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ON THE GENERAL THEORY OF MULTIPLE 
CONTINGENCY WITH SPECIAL REFERENCE 
TO PARTIAL CONTINGENCY. 


By KARL PEARSON, F.R.S. 


(1) Let there bel variates or characteristics A, B,C, ... L, each of these variates 
or characteristics being subdivided into categories A,, A,, ... Aa, B,, Bg, ... Bs, 
C1, Cy, ...Cy, ... Ly, Ly, ... Ly, where a, B, y,...A are arbitrary numbers. Then if 
N be the total population, and na,, ma,, ... Maa the number of individuals in the 
A-categories ; My > Mp,» +++ Mbp those in the B-categories and so on, we have relations 


8, (%q,,) = 6 (mp,) = 8 (m,.) =..=N. 


Further, if there be no relationship whatever between the variates or characteristics, 
we should anticipate that the frequency of the group A,, B,, Cy, ... L, in a sample 
of M would on the average be 


Actually we find in the sample M the number m,,,,..,, and the problem arises 
whether the system represented’ by myyw..y 18 so improbable that in the selected 
population M the characteristics A, B, C, ... L cannot be considered independent, 
i.e. M is really not a random sample from the supposed population N. Clearly 
the answer to this problem has already been given. We have to find the value 
of x?: 


(Me. —M. 


M. "au 





== | 


and apply the tables for “goodness of fit.”” Of course in many cases the sampled 


Naw , ™y 
A? Oa 


and test from this substitu- 


population is not known and accordingly we can only put for the 


™Mq,, mM, My 
MM’ M’ 
tion the degree of divergence from independence. If we take the mean value of 
x*, i.e. 62 = x2/M, ¢? is termed the mean square contingency, and C, = V2/(1 + 6?) 


Biometrika x1 


values given by the sample itself, i.e. 
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gives a measure of the divergence from independence*. This is a multiple con- 
tingency coefficient. 


Another case not infrequently arises; the population N has the characteristics 
A, B, C,... not independent, but related; the cell wow... contains Myy..y and 
the question arises how far it is safe to consider the population M as a sample 
of this population. In this caset 





In both these cases we have the relation 


i ts AE censecicsownsasenssvererneecead (iii), 


and accordingly the number of cell frequencies is one more than the number of 
independent variates. Thus in using the tablest of “goodness of fit” the n’ of 
the argument is ay ... A, but the value of P, the probability, has actually been 
determined from n’ — 1. 

(2) Now there are a number of cases in which not only do the cell-contents of 
the sample obey the linear relation (iii), but also other linear relations are imposed 
on the cell-contents. In the most general case we can suppose q linear relations 
between the cell-contents m,,....y, and obtain the probability P corresponding to 
a value of x? limited by these q relations. The theory of sampling, when such 
conditions are introduced, I term the theory of partial contingency. The reason 
for this terminology will be clearer as we develop the theory. 


As far as we are concerned at present, it is of no importance whether we are 
dealing with one or other of our two cases, i.e. whether we are questioning the 
possibility of our material being a sample from a population with independent 
A, B, C,...L characteristics, i.e. determining a coefficient of mean squared 
contingency, or are investigating the possibility of its being a probable sample 
from a population with any associations between these characteristics. We can 
accordingly write. x? in the form 

a ca aS (iv), 
or for convenience we may even drop the descriptive subscripts and, numbering 
the cells in some sequence 1, 2, 3, ... s, ... (aBy ... A), write 


2 g, ™—m,)* (v). 


* If the characteristics may be assumed to be continuous variates, certain corrections for units of 
grouping can be made. There is also a correction due to the necessarily positive value of ¢*%. These 
corrections, which have been for some time in use, will be considered elsewhere. 

+ We must in this case of course actually know the value of ny,» »..y, it cannot be judged from the 
sample. 

t For discussion of the deduction of P from x?: see Phil. Mag. Vol. u. p. 157 (July 1900); and for 
Tables: see Tables for Statisticians and Biometricians, Table XII (Cambridge University Press). 
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Here fiyow...y = Mm, will be either 


Ma Mb My 


M 
‘Wen wy OP tuwv 
as the case may be. 
Now we shall find it convenient to write 


Be ty BGM. wicciisimnncomn (vi), 
and thus 


Se TSI sins seerinanestienmarnenioinl (vii). 
Further, if our q linear equations be of type 
him, + hizMzg + eee + hi,m, + coe = H, eee ceerereeceseces (viii), 


where h and H are constants and ¢ takes every value from 1 to g, we can write 
our conditions in the form 


ky Xy + higQXot ... + hg X, +... = K, 

















where i 
Bee ee hes Ving es . 
ts Vh? my a MMe + ... + h?,,m, + ___ peseeesececorens (1x). 
and ; K,= H, — heymy — hms i Bi his, — «.. 
Vh? 1, + h®,gimg + ... + h?,,m, + ... 


We shall speak of the first of (ix) as the prepared condition. Clearly it corresponds 
to a plane in n-dimensional space in which the constants ky, ky, ... kys, -.. ate the 
direction-cosines and K, the perpendicular from the origin on the plane. It is 
convenient to use the notation 


Kiskyy + hegkye +... + Bigkye + ... = 008 (tt’) ........ 0c eeee (x), 
for (it’) is now the angle between the ¢th and ¢’th planes. 
Assuming that the frequency surface with which we have to deal may be taken 
as 
CGN N  cncimmaacn amen (xi), 


we may suppose before applying equations of condition (ix) that K,, K,,...K,,...Kq 
are variates and that we eliminate X,, X,,...X,, expressing our x? in terms of 
K,, Kg, ... Kg, Xqu1, Xerg, --» Xn- We shall have then 


z= 2 expt. — } (quadratic function of these n new variables). 


We now proceed to put K,, K;,... K, constant, but leave the other n—q 
quantities to vary; we are therefore seeking the value of x? for certain variates 
constant. This is the essence of partial contingency, and the analogy in con- 
tingency to partial correlation. 


(3) As a rule in partial contingency we do not seek to discuss x? when single 
cells of our multiple contingency solid are constant in frequency, although our 
theory covers that case. What we require usually is the value of y? when we make 
the contents of certain marginal total cells constant. For example, let us consider 
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a population of uniform sex and let there be three characteristics: (i) frequency 
of age groups, (ii) frequency of occupational categories and (iii) frequency of survival 
and of death, the latter classified by various special disease classes—-the whole, 
say, representing the returns for one year of a large area or country. We then 
require to determine, whether the like contingency solid for a sub-district, or for 
another population entirely, may be considered as significantly different from the 
above general population, i.e. we require to find the probability of its being a 
random sample cf this general population. Now we may do this in the most 
‘universal manner, by assuming that not only survivals and deaths, but that age 
groups and occupations all have frequencies, which are random samples of the above 
general population. But this is not very often what we require; we admit that 
the age distribution is differentiated, we admit that the occupational frequencies 
are peculiar to the locality, and we ask whether, notwithstanding these differences, 
the death distribution is to be considered as a random sample. 

In other words, we do not only fix the size of our sample M; we fix all one 
face—that of age groups and occupational categories of our contingency solid— 
and ask what is the distribution of samples of M taken from this solid, subject 
to the linear conditions that the totals of age-occupational categories are constant. 
For example, if A be age and B occupation, we make n, 4 constant for all values 
of u and v. but 

Nayb, = “aybyc, + Nq, byes + Nay, by cs + wey 
where 1q,5,c, is the frequency of the wus cell, c denoting the category C of type of 
death and survival. 


Now clearly in making this investigation we shall be studying the mean square 
contingency and the resulting probability of a partial sample—a ‘sample of survival 
and death-type in a population of constant age groups and occupational classes. 
Again, we might treat a population as a sample with only constant age groups or 
only constant occupational frequencies and again investigate its probability as 
a sample with regard to deaths and occupations or with regard to deaths and age 
groups respectively. These would be partial contingencies of the order a or B, 
while the previous partial contingency was of the order a + 8, a being the number 
of categories in A (i.e. age groups) and 8 being the number in B (i.e. mortality 
and survival classes). 

(4) Now the value of x? given above, and of the frequency surface (xi), was 
discussed by me in the year 1900* from the general normal frequency surface by 
evaluation of determinants. The demonstration depends on two hypotheses: 


(i) The approach of the binomial M (p + q)" to the normal curve 


We know that this is true if m be considerable and neither p nor g very small. 


* Phil. Mag. Vol. t. p. 157, 1900. I have recently given a more elementary proof with the probable 
error of P: see Phil. Mag. April, 1916. 
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The hypothesis is justified therefore if no cell be taken so small that its contents 
are very small compared with the size of the sample. 


(ii) That the sampling takes place out of a population indefinitely greater 
than the sample. If tiis be not true, then the distribution of frequency of any 
given cell for a series of samples follows not a binomial but a hypergeometrical 
series. The necessary modifications in the formulae are not very substantial and 
have been discussed elsewhere*. 


Supposing the above two conditions to be fulfilled, then in true random 
samples the mean of a cell frequency will be 


where M is size of sample, n, the contents of the sth cell in the sampled population 
N, and condition (i) amounts to saying that no cell is to be chosen so that n,/N 
is indefinitely small. 


Further, the frequency of the sth cell will follow the binomial 


0-3)" 
N N/) ’ 
and thus have a standard deviation given by 


n; nN; - 
7, = M m( -- 7) ielsevintibipneieiivaebndia (xiil). 


Lastly the correlation 7,, between deviations in the sth and s’th cells is given by 


aOR ae — FE ae ae - cece creseccnssscncszecureets (xiv). 


(5) The following deduction of the value of x? is a variant from my Phil. Mag. 
proof. I owe the suggestion of it to Mr H. EK. Soper, although I have deviated some- 
what from his track. Let an indefinitely large population N consist of the classes 
Cy, Cy, ...C, in the quantities ny, ny, mg, ... m, respectively. Then p,=n,/N = 
chance of drawing a member of the class C,, and the standard deviation of the 
distribution of frequency in samples of M drawn from the population will in this 
class C, be as above 


| a ee eerneee (xiii)>is, 
Further, the mean of samples for this class will be Mp, by (xii). 


In the next place the correlation between deviations from the means in classes 
C, and C, will be in our present notation 


igs = = FER Mg assis sisceserereewesues (xiv)>is, 


= VPsPs RE deepen reieenanaiel (xv). 


or by (xiii)>!s ‘.~=—- = 
sii vi- Ps v1- Ps 


* Phil. Mag. p. 239, 1899 and Biometrika, Vol. v. p. 174 
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‘The distribution of frequency of the different classes in these samples of M will 
be given by the terms of the multinomial 





(PoC aa pC; +- pC; a eee + p,C,)* eee eee eeesereceees (xvi), 
the general term being 
M! u 
Uy! u,! U,! eee u,! Po" Pi" Po . Pi" Cy“ C\" wae Cc," req¢eoser (xvi), 


_ where Co, Cy, ...C, are only logical symbols to denote that this general term is 
the frequency of the group, where the class C, occurs u, times in the sample. 
Clearly uw +u,+...+u,=M. 
But (xvi) may be put into the form of a binomial, it is 


m=M 4M! 
“3, m1 (M — m)! (PoCo)™-™ (pC, + poC, + ... + pyC,)™...(xviii). 


Let (p; + pot... +p.) =A, then the mth term of the above series may be 
read as 


M! , , U M4 
m! (M — m)! (PoCg)™—™ A™ (py'Cy + po’ Cat ... + py Cy)™.....- (xix), 
where Pi + Pe +... + py =1, 
Pi _ Pe _ Ps Pi 
and ee ee Op cee = —=A=]-— Ceeccccccescccsocs . 
Pi Pa Ps Pr ” _ 


Now it is clear that the factor 

(pi Cy + pe’ Cy + ... + py'C,)™ 
gives .the frequency distribution of samples of m drawn from a population of 
indefinitely large size of which the proportions of the classes C,, C,, ...C, are 
P'» P2, --- Py and in which no class Cy occurs. But by (xx) these proportions are 
the same as in the original population which contains C,. 


Hence if we take samples of M from an indefinitely large population with 
classes Cy, C,, C,, ...C,, those that contain m of the classes C,, C,, ...C, will be 
distributed in the same proportions as if we had extracted m from an indefinitely 
large population consisting only of those ] classes in the same proportions. 


Now thus far the nature of the class C, is at our choice. In the original popula- 
tion N it appears with the total frequency nm. Let +n’ = N, and suppose 
m, is indefinitely greater than n’, then p, will be indefinitely greater than 
Pi» Po» ++» Pn- It follows that p, if s be not zero is very small compared to 
unity, because 

Pot Pit Pet -. + Pn = 1. 
Hence in such a system from (xiii)>!s, if s be not zero, 
i I ciesitdescsseiebicntiecassaniandcsnscanil (xxi), 


and from (xv) to the same degree of approximation r,,=0. That is to say, if 
Py be large in taking samples of size M from an indefinitely large population, there 
will be no correlation in deviations in the frequency of the classes C,, ... C;. 
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On the other hand r, is not zero, but equals by (xv) 
—V p.polV1 — po = — Vpi/V1 — po 
to the same degree of approximation. Hence by (xx) 
Gat Re Se Ee (xxii). 


But if we form the partial correlation of deviations in classes C’, and C’, for constant 
frequency of number in C,, we have 


ose = (Tse — Tso7 10)/V(1 — 149) (1 — T 19°) 
= (0 —V py Vp: )/V(1 = p,) (1 — pr), 





or es 2) 2 eee (xxiii), 





which agrees with (xv), the classes being now reduced by unity. Further, the 
reduced standard deviation must now be 


0%, = 7/1 — Tos” 


=o,V1—p,' 
- Vip 
ere (xxiv). 


Now take the mean value 9%, of x, the frequency in class C, for a constant 2» or 
for constant frequency in class Cy; in our case, if the sample is to be m this will 
be M — m, we have 


= & C5 = 
oh, — By = To — (Xo — Zp), 
d') 


‘ m 
- yo, = P* = mp,’ assconasncvenitainenniaanvsenvasoeionl (xxv). 


The partial values g,, r,, of the means and correlations of classes for constant 
number in class C, are given by (xxiii) and (xxv), and are what we might anticipate. 
But (xxiv) should be Vmp,' (1 — p,’). It is accordingly needful to take 


d= m/M, 
ead Bo 1 — GE .....orccccrscccssvccccesernsevess (xxvi). 


These results have been reached on the assumption that p, is very large as 
compared with p,, p2,...P,- It follows accordingly that the sample M must be 
large as compared with m, and further the sum of the classes C,, C,, ...C, must 
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be to that of the class C, in the total population in the ratio of the partial sample 
m to the total sample M. Without this condition it is not possible to replace 
Mp, by mp,’. Assuming these conditions to be satisfied, then samples of the 
size m in classes C,, C,, ... C,; picked out of very large samples of M will reproduce 
the same distribution of frequencies in those classes as samples of m picked out of 
an indefinitely large population with the same relative frequencies in those classes. 


But in the case of samples of M, the deviations have their correlations zero 
. for the classes C,, C,,...C,, or they will be approximately distributed by the 
product of their independent probabilities. The standard deviation being Vmp,’ 
and the mean mp,’, we see that the frequency distribution would really follow 
a Poisson’s binomial limit, but as shown by L. Whitaker* this binomial limit is 
approximately Gaussian with fairly low values of mp,'; see the Diagrams for 
mp, = 10 and = 30 in the plate of her memoir. We may accordingly therefore 
take the distribution of the frequency in the sth class or cell to be given by 


1 (m, -in,)* 





Z,=- a —e€ Thier siecadessreasebies GOcaradlel (xxvii) 
V2 Vin; 
and the general distribution to be 
l 
“Savane * | : 
Pes DP  sasiceeineptinal (xxvill). 
where x' = S, {foe 
Here, if the size of the sample only be fixed, we shall have 
S (m,) = m = mS (p,’) = S (m,), 
or S (m, — ™,) = 0. 
mM, — Mz 
If we take X,= ag a 
we have: x= X;? +X; A nine SA sekddevenereeucennel (xxix) 
subject to the condition: 
V ii, Xy + VitigXy + on. + VX =O oo. eeeeseereees (xxx). 


It is clear that x? equal to a constant gives a sphere in [-fold space, and that 
(xxx) is a plane passing through its centre, and therefore cutting the sphere in /-fold 
space in a sphere of the same radius in (l — 1)-fold space. Hence if we desire to 
find the volume of the frequency surface (xxviii) which lies outside a value of 
X = Xo subject to the condition (xxx), all we have to do is to transfer to polar 


* Biometrika, Vol. x. p. 36. If m,=mp,’ be the mean, 8,=1/m,=8, —3, so that 8,=-03, 8, =3-03 
already for m,=33. 
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coordinates and integrate the value of z for the (/ — 1)-fold surface beyond the 
value x)*. Accordingly 


is the chance of a sample occurring with as great or greater deviation as the 
Xo sample from the general population. This is the expression from which the 
Tables of ‘‘ Goodness of Fit” were calculated, the arguments being x,” and I, i.e. the 
value of x? for the sample and the total number of categories in the sample. 
Thus far there is only difference of method of deduction, not of results. 


(6) We now propose to replace condition (xxx) by a series of q linear equations 


of form (viii). These in the case of sampling will, if the size of the sample be 
fixed, either directly or indirectly involve (xxx). 


The type of these equations in their prepared form is 
Rhy X, t+ higXet ... + by X, +... = K,. 
Each such plane will intersect 
7 = X24+ X,' 4+ ... + X,7 +... 
in a sphere of lower order. For example, if there be n variates X, the first plane 
gives a sphere of the (n — 1)th order, this will be intersected by the second plane 
in a sphere of the (nm — 2)th order, so that ultimately we find ourselves reduced to 
a sphere of the (n — q)th order, by the intersection of the qth plane. If K,, 
K,, ... K, were all zero, the radius of the sphere of the (n — q)th order would be 
the same, i.e. y? as the radius of the sphere of the mth order. But since these 
quantities are usually not zero we have to determine the radius of this sphere. 
The centre of this sphere must lie in every one of the qg planes of the nth order, 
and accordingly on the plane of order n — (q — 1) in which they intersect. But 
the centre of the sphere of n — q order is where the perpendicular, K, from the 


origin meets this plane of the (mn — q — 1)th order, and the radius x’ of the sphere 
of the (n — q)th order is given by x’? = x? — K®. To determine y’ we must find P. 


Now P will be the minimum distance from the origin to the plane of the 


(n — g — 1)th order in which the q planes intersect. In other words to find P we 
must make 
D?= X,?+ X7 + we $ X2+ + X,2 


a minimum subject to the g conditions of type 
key Xy + higQXot ... + hygXyt ... + heyy Xy = Ki, 
where Keay? + wigg? + 0. + kegs? +... + heyy? = 1. 
Using the method of indeterminate multipliers we find n equations of the form 


Xt Aykyy + Agha, + ... + Aghgs = 0. 


* Phil. Mag. Vol. t. p. 158. 
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Multiply by X, and add the series, and we find 


Pe aS ee Aer (xxxii). 
Multiply by k,, and add the series, and we find, by aid of (x), 
—K,= Ay + A, cos (12) + A, cos (13) + ... + A, cos (1g). 


Similarly : 
— K,= A, cos (21) + Az + A, eos (23) + ... + A, cos (29), 


SOOO O HOHE Eee Eee EE EEE EEE HEHEHE EE EEE EEE EEE SHEESH HEE EEE EEE EEE EEE EEE SHEE SEES SEES EES 


— K, = A, cos (q1) + A, cos (¢2) + Tae +..¢ A. 
These are q equations to find A,, A,,... A,, and we can then substitute in (xxxii) 
to find the required P*. 


Now consider the determinant 
R=| 1, K,, K,, i © | ...(XXxiil), 
K 1 (12), cos (13), ... cos (1g) | 
K,, cos (21), 1, cos (23), ... cos (2q) 
K (31), cos (32), 1, __... cos (3q) 


Oe PUES COSCO OOOCOOCCOCOOCOOCOOCOCOOOCOSOCOCOOCOOOOeC Oe S Oe ee ee 


aq ©08(gl), cos (q2), cos(g3),... 1 





and let us call the first row and the first column the 0 row and 0 column, then 
clearly 


— A= aa Rot/Roo, 
where R,, is the minor of the sth row and th column, and 


_ Kt= K, Ro, + K, Roo + ... + K,Rop + ... + Ky Rog 





Roo 
1 — K* = R/Re, 
or Be. 3 FF A si ptverensnciicanionenenal (xxxiv). 


If we call A the minor Ry, we have: 
Rot = = K, A; Kd K,Ag: Ff one K An PF ebee 
A 


Thus K?=8 (Ke =) + 28 (KK =) eee ee (xxxv). 


From this we deduce that the probability of a sample which gives x? = x,’ 
with q linear conditions must be obtained from 

a e~ BX* yn—a-1 dx 

xo?- K? : 


[re -3x? x" dx 


0 


P= 
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We must therefore in order to find P enter the Tables of “Goodness of Fit” 
with x? = x,? — K* and with n’=n—q+1*. 


The reader who is familiar with the theory of the multiple total and partial 
correlation coefficients will note how closely analogous the formulae (xxxiii) to 
(xxxv) are to results in that theory. The fundamental determinant of the 
correlation may be made to agree with A, if we merely write r,, = cos (ss’). 
In fact both theories really reduce to the discussion of the formulae of spherical 
trigonometry in multiple space. 


(7) A simple illustration of the above formulae may be taken from the case 
of the distribution of mortality in two districts, where the problem is to ascertain 
the probability that the difference of mortality observed allowing for the frequency 
of the age groups could be due to random sampling. 


Let the population sampled be represented by 


(a) Dy | Dy | Ds | -»..- ; }oe D,| A 
ay Se 4 ee oy gees L, | A 
rarar ss pa ae A, | P 


where D, are the dead, L, the living and A, the exposed to risk in the sth age group, 
there being w such groups in the total population P, of whom in the given period 
A die and A survive. 


Let the districts be represented by 


























(B) d, | d, | d, | ...... | eee d, | 6 
Sek ae tee | he hve L, | A 
Ga | My | Ge | cee: | | on | a, | p 
and 
(y) OS tO LE b cccem Pa ft ne d,’ | 3’ 
Ae “ae ae oe we eee ee 
Ga | Gg | Ge | acess. ie rere a,’ | p’ 








respectively. Then the problem is to ascertain the probability that the last two 
distributions could both be samples of the same first population. 


The general formula has been given by met; we have to evaluate 


tu) 
ene <4 


CD 


* The tables are constructed for n’—1 independent variables; in our case there are n-—g such 
variables, hence n’-l=n-gq. 
+ Biometrika, Vol. vim. p. 252. 


VOL. U— L 
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where f, must take the value of every frequency in the cells of (8), f,/ the corre- 
sponding cell frequency for (y), and F, that for (a). We have to seek for P under 
the argument n’ = 2u in our tables, if we have no restrictions on our variates. 
Actually we have such restrictions, for we are going to seek the partial x? when we 
suppose the age groups in each sample to be constant. Now 


d,+l,=a, and d,’+l,’ =a,’. 

Thus it follows that 

qéd ., lf «4 «a es 

HH — AF ceccccccccccccccecces (xxxvn). 

a a a a oe 

We have accordingly u equations of condition or n’ as argument will be 
reduced tou+1. Now we take 
(* “i Ss) 
pp’ \p_ ep’ 


A, = 4/ — 
pt+p VD,/P 
(—*) 
v,— ,/ 2 Bogle 
° NM pt+p V1L,/P 
Thus x? = S," (X,%) + 8," (Y,%), 


with uw conditions of form 
pt+p 


or in the prepared form 


JBix,+ Jes. - w(p- 9) 
Thus K, -»/ JF ee NP V P 


Further, all the cosines like cos (ss’) are zero, for no equations of condition 
involve the same variates*. Thus 


K?= K,?+ K,?+...+ K,? 
op’ Pifa, «,\* = 
= ia ? , O" A, & ir) uciseesereueeene (xxxvill). 
Accordingly 
d, d,'\? rs. LY P ja, 4a,\? 
7 on 2_ K2— pp’ Sg te *) + §,* — 7) eR oe (s—%)". 
os p+p lt D, p “T6 p VA,\p Pp 
We shall now substitute from the relation (xxxvii), getting rid of ae 


* This will necessarily be true if our equations of condition refer to parallel rows or columns, not if 
they refer to certain rows and columns, 
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We find 


pt — PE {aye Pas (i _ det _2P (&_ 42) (%4_ 4). BD, (a _ 22) 
pt+p D.L.\p p L\e F/\e #/ 44.9 ¥ 


= PP’ gu ee (¢._d, D, (‘+ — 25) |" 
ptp * (DL, - p A, \p pli) 











Let 7 = D,; _ = q,, then we may write 
—_ _ pp’ S.u ( : oe (“= Psds _ d,’ — Pcs. + 
we ete Apa oP p ) 


Now if we know the population sampled, we have only to insert the values 
of P, A,, and p,, q, to obtain the value of x,?._ But if we do not, and this is usually 
the case, then our problem is: Are the two districts samples of the same unknown 
general population? For example, we might enquire whether the death distribu- 
tions in Bradford and Leeds were, correcting for age, significantly different. It 
might be supposed that it would be correct to give A,/P, p, and q,, the values 
found for all England and Wales. But it may be doubted whether this would 
be satisfactory. The populations of Bradford and Leeds might fairly be considered 
as samples of a general population which is very far indeed from being that of all 
England. Accordingly it seems much more reasonable to suppose that they are 
samples of a population whose mortality characters will be best represented by 
the combination of those of the two districts themselves, or we take 


pe d, ot d,’ A, : a, + a, 


a, + a? P pt+p- 





s 


Substituting the first of these relations we find 

















a fe d,\* : 
sa a,Q5 —_ ma r) \ 
Xo? = S," ed SEE nara Ee (xxxix) 
as + as A, (d +4y(1-S+8 F 
p + p’ P 8 s a, a a”) / 
If we substitute the second relation, (xxxix) becomes 
8 a,’ Ul d, d,’ . 
7. “oe 
3... £0 ‘S 8 
a =? ae er sat ie (xl), 
(er) (d+ 4) (1- S25) 
+9 a, + as 


and this, I take it, is the best measure of significance in the difference of the 
death distributions allowing for age groups in the two districts. 


It is clear that the first factor will in many cases differ but little from unity. 
For we should anticipate that approximately 


a,= £A,, a! «  A., 
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Hence it would follow that 

G%y _ (As\* _ (4+ % \* 

4 - (4) Cte): 


or the first factor is approximately unity. The value of x,? then becomes 


( a,a,' (2&5) 








Q= 8," a EE, Se li), 
icra) an 


a quantity which I have shown elsewhere* has an important relation to the question 
of the significant difference of corrected death rates. 


It is obvious that the present method is of wide generality and I propose to 
illustrate it later by considering whether the general death distributions of various 
national groups allowing for age distribution, occupation distribution and character 
of mortality are or are not significantly different. 


Meanwhile (xxxix) provides us with a formula which enables us readily in the 
case of districts, special diseases or class groups to assert whether mortality 
experiences corrected for age distribution are or are not significantly different. 
As far as I can see (xxxix) and its extensions to occupation groups provide the 
proper means of ascertaining whether the populations at risk in various insurance 
offices or friendly societies are or are not materially different in character. It 
should be a guide to the actuary also as to which classes of the population it 
is most desirable to cater for. 


I have to thank my friends Mr H. E. Soper and Mr A. W. Young for 
suggestions and help at several points. In the following paper numerical 
illustrations of (xl) are provided. 


* See Biometrika, Vol. x1. p. 164. 


ON CRITERIA FOR THE EXISTENCE OF 
DIFFERENTIAL DEATHRATES. 


By KARL PEARSON, F.R.S. ann J. F. TOCHER, D.Sc. 


(1) To determine whether the general deathrates or the deathrates for special 
diseases in two towns or in two classes of the general population are significantly 
different is a problem of very great importance. It is now generally recognised 
that crude deathrates are of little service for this purpose, the age distribution 
in the two districts or in the two classes of the population may be widely different, 
and the general deathrate or the special deathrate, i.e. the deathrate for a special 
disease, usually is a marked function of age*. The deathrate is therefore corrected 
by reduction to a “standard population.” We ask what would be the deathrate 
in the given district or class if its age distribution were constituted in a given 
or “standard’ manner. This deathrate is spoken of as the “corrected deathrate.” 
Now we usually suppose the deathrate to be subject to “probable error.” In 
other words in a population of size n, if p be the chance of a person dying in the 
year, g the chance of a person surviving, then the standard deviation of the number 
of deaths is npg, or the probable error of the corresponding deathrate m is 
a Vpq. If there be a second town of deathrate m’ and population n’, and 


chances p’ and q’, we are tempted to compare without very full consideration 


m’ — m with -67449 4 + et to obtain a measure of significant differentiation 


probably using a table of the probability integral. This method is for several 
reasons unreliable and fallacious. In the first place if it be applied to the crude 
deathrates, p and p’ can hardly be applied to the individual, they are so markedly 
a function of age. In the next place for many diseases, or for many ages in 
a general deathrate, p and p’ will be very small; accordingly there will be no 
approach to Gaussian distribution, but the binomials will approach Poisson’s 
Exponential Limit, in which case the meaning of the standard deviation of a 
difference requires much further consideration and probabilities will not be given 
by a table of the probability integral. 


* It is also well known to be a marked function of class. This is customarily disregarded in a 
comparison of local deathrates. But to test real sanitary efficiency a standardised class population 
may be as important as a standardised age population. 
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But the method outlined above has this importance: it suggests that the 
deathrate obtained is only a “sample” deathrate, and subject to the variations of 
sampling; thus it forces the problem upon us in a very definite form: Can two 
populations dying in a known manner during a given period be considered as 
samples drawn at random from the same material? Here again the age distribution 
difficulty arises, for by hypothesis we admit the age distributions of our two samples 
are not the same. Let us fix our attention for a time on a fairly narrow age group, 
_ say that d, deaths occur in an age group of size a, in a certain population and that 
the chance of death in the age group is p, and the chance of survival g, in the 
population out of which the sample is supposed to be drawn. Then undoubtedly 
the standard deviation of samples would be Va,p,g,, but as before the distribution 
would hardly be Gaussian. Now let us suppose the standard population, size A, 
to consist of the age groups A,, A,,... As, ..., then the “corrected” deathrate 
M will be given by 





d, A, 
M=S (3 4) PoOrTIrereT Terre rrerrerreri rT Teter (i), 
and if § denote a variation due to random sampling, we shall have 
8.d, A, 
3.M=8 ("="). 


Now speaking generally we do not “draw” our deaths in such a manner that 
their total remains constant. A shot so to speak fired at one age group is not 
to be supposed if it misses that group to have a chance of hitting a second age 
group. There will thus not be any of the usual negative correlation between a 
variation ind, and oneind,. Of course epidemics which in a given period attacked 
individuals in certain age groups only might show a positive correlation between 
5d, end dd,. But in general it will be sufficient to suppose the variations of the 
d,’s independent, and measured by the probability of death in each group. Thus 
we should have 





Similarly if there be another population with age groups a’, and corrected deathrate 
M’, then 


M'=§ (<4) 

M ot - A)? 

Ps4s As” 
and oy’? = s (Pe “s). 


We do not write p’, and q’, in this population, because we are supposing both 
to be samples from one and the same population. Now clearly 
M’-M=S8 (3 i “:) ri ae pape, bis (iii), 


as a, 
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and there will be no reason for supposing any correlation between d’, andd,. Thus 
it will follow that the standard deviation of M’ — M is Voy? + oy. Hence: 
tte : 
Bar = SY Deds (— + =) Ay} cnn ceeeccecccescesceoees ‘ 
OM’ -M ip qs (- a x) ai - va sanouiceawaoes (iv) 
Now assuming p, and q, for the moment to be known, can we learn more from 
the relative values of M’— M and oy_y than we thought possible from the 


distribution of the deaths in a single age group owing to the latter’s non-Gaussian 
form ? 

It seems probable that we can for the following reasons. Let z be the sum of 
u variates z, + 7, + ... + 2,, these variates following arbitrary laws of frequency 
and being in no way correlated together, i.e. z is to be found by taking a random 
selection of each of our u-variates and adding them together, then from 


Z=%+%t... + Ly 
we can find the moments of z. Obviously we can measure all variates from their 
means, and accordingly we find: 
ella = S (ape) = UX afte, 
abs = S (a3) = U X og, 
alta = S (xp) + 6S (ope - afte), 


ait meyeltes és 


2 


2 (#2), 


where , 12, #3 and ,4, denote the mean values of the moment coefficients for the 
various x-distributions. Accordingly if B, and B, be the f-coefficients for z: 


l sis? 15 
B, = U ails? = = B, PITTITITITITITITITTIT TTT TTT TTT (v }> 
Ter oe 
Bam Gat 3(1- ) 
. « 
RR ee ob enoancnioy (vi) 


where f, and f, denote £-coefficients found from the mean moments. Now in the 
case of the Poisson’s Exponential Limit to the Binomial we have* 


otiw 7 
B, = 8B, -—3= non Sa rettssesseseeseseeseenees (vil), 
where m is the mean number of deaths in the age groups. Hence, if we dealt with 
a fairly large population, where the number of deaths in any group were say 5 to 
10 and we made 10 age groups, we could reckon on B, and B, — 3 being of the 
order -02 to -01, or the distribution would be closely Gaussian. Thus there need 
be small hesitation in applying the tables of the probability integral to the 


investigation of the relationship of M’ — M to ojy-». 


* Biometrika, Vol. v. p. 353 and Vol. x. p. 39. 


Biometrika x1 
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(2) Two points, however, arise in this work. We do not know p, and q, 
nor have we yet selected the standard population, i.e. the values of ratios 
like A,/A. 


With regard to p, it may be held by some that the value obtaining in the 
general population should be given to it. This might be reasonable if that popula- 
tion were immensely large as compared with either group under consideration, 
but very often the group dealt with is quite considerable as compared to the 
_remainder. Thus in Scotland for certain administrative purposes we consider 
Scotland (Clyde), and Scotland (excluding Clyde). In England for similar purposes 
we find nine official districts* selected, so that if we were comparing the North- 
western district with the London district, it would be curiously difficult to demonstrate 
why if these are samples, we take them plus the remainder to be more representative 
and fixed than either alone. We doubt very much whether the material we 
suppose we are sampling is to be considered as the general population. Rather 
we look upon that general population as the indefinitely large group who might 
be considered as living and dying under the same environmental conditions, if 
they continued indefinitely in force. 


Again, is it quite correct to take p, from the general population of the country 
when the problem is to discover whether the two districts are themselves random 
samples from some population which may not be the same as that of the general 
population of the country? Thus Aberdeen and Inverness might both well be 
random samples of a population which is not that of Scotland as a wholet. Hence 
as in cases of probable error it will usually be best to calculate p, from the observed 
material itself, i.e. if our two groups be really samples of the same population, 
then probably the best thing we can do is to take 
_4+d, d,+d', 
ata, a, +a’, 


If p, be small, as it usually is, then it will be sufficient to take 


d, d’, A, 
tran (tet) 4 


It remains to consider A,/A. Here again it is not unusual to take A,/A as 
given by the general population. Or if the “corrected deathrates” for a series 
of years are being compared, it is not unusual to reduce the age distributions to 
that of a certain year, e.g. in the manner of the English Registrar-General for 
tuberculosis and cancer deathrates to the population of 1901. For the same 
reasons as in the case just discussed we might, perhaps, find it fitting to take 
A,/A as given by the material under discussion, i.e. = (a, + ’,)/(a+ a’). Or, 
again, we might reduce one district to the population of the other, i.e. take 
A,/A = 4a,/a, or a’,/a’. It is of interest to see practically what differences such 
divergent reductions make. 





and g,=1-— 


8 


* By the Board of Trade for example. 
+ Or again lawyers and the clergy may have or may not have significantly different mortalities, 
but the mortality of both differs essentially from that of all England. 
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Another solution of the-standard-population-preblem, which seems to us of 
some importance, arises from the consideration that we ought to select our standard 
population so that the probability that the two districts or classes under considera- 
tion are samples of one and the same population should be a minimum. In other 
words we ought to select A,/A (= X,) so that 


Q =8 (A, X;5)/VS (v,X,"), 
where A, = d’,/a’, — d,/a,, 

Vs = PsIs (1/a’, oe 1/a,), 
is a maximum, subject to the relation S (X,) = 1. 


Proceeding hy the usual rules for finding a max.-min. we have 


A; v,X¢ 
80-0 8 (sare) ~ sux) OX 
S (8X,) = 0. 
Therefore if P be an indeterminate multiplier: 
A; v,X, 4+P=0. 





S(A,X,) Sv, X,%) 
Multiply by X, and sum all such equations and we find P=0. Hence: 


_ A, SX") 
Xe ve BX)’ 


Sum all such equations and we have 





A,\ S (v,X;?) 
1=8 (7) 50x) 
x 3 
Thus we deduce: X,=--——., 
Vs g *) 
whence S (A,X,)=8S ( /s (*), 


and accordingly Q= V/s () 


A little consideration shows that this is a maximum value of Q. The argument 
we then use is that if on the standard population which provides a maximum for Q, 
there be no significance in the deathrate difference, there cannot be any significance 
at allin the difference. While on the other hand if on any population whatever 
used as standard, we do find a significant difference, such a difference really exists. 
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Returning to our deathrate symbols we have 





d’, 4, 
A, a’, ws as 1 eee 
= X, ma q, ‘i Pe. oe (vill) 
Pols a’, a a’, a, 








, (3 =) ; ‘3 é,) 
i Oo et ~ = fe 
oe Fe or very approximately to tage 
(d’, + d,) (1 _@,+ a acta y d,+4, 
s $s a’, a5 a, 


We note at once that this method would give irrational values for A, in the 
case of any material for which the age deathrate was not invariably greater for one 
district or class. On the other hand Q the ratio of the difference of corrected 
deathrates to the standard deviation of that difference is always real and given by 


a’ a, (3 _ “) | 





Q? = S- 7 bi (eseaegeseatesceesseee (ix), 
(+ ay (1-4) 
la’,0, (5 * “| 





=f | T+ a, | approximately. 

If for every value of s, d’,/a’, is either greater than d,/a,, or, on the other hand, 
is always less than d,/a,, then the odds are about 50 to 1 if Q be as great as 2 that 
there is significant divergence of the two corrected deathrates. On the other hand 
if Q has no significant magnitude, the deathrates will not be significantly different. 
Supposing the condition as to the relative magnitude of d’,/a’, and d,/a, be not 
satisfied, then if @ has no significant magnitude for these irrational age classes, 
it will certainly have no significant magnitude for any other size of age classes, 
and accordingly we conclude that the difference of the deathrates is not significant. 
But if Q be of significant magnitude for the irrational age classes, it does not follow 
that it will be significant for rational age classes, and further discussion is needful. 
Of course any case in which for the bulk of groups d’,/a’, is greater than d,/a,, 
but for one or two groups d,/a, is the greater, will have Q = (M’ — M)/oyw-m 
lessened by the inclusion of these cases, for the numerator of Q is decreased and 
the denominator increased. We should therefore be at liberty to consider only 
the groups where the deathrate goes one way, for the age groups are actually 
independent, and it is really a fictitious balancing of the corrected deathrates 
which arises, when one age class with deaths in excess compensates for another 
age class with deaths in defect, and so tends to equalise M and M’. This is of 
course a grave difficulty which must arise when we deal with any corrected 
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deathrate at all. Two such deathrates might show no significant difference 
although the aged were dying in one population and the young in the other in 
excess*. We might work only with the groups for which d’,/a’,—d,/a, is of 
the same sign, and this would indicate, by the same test, differentiation or its 
absence in the manner of dying; but of course we should then be dropping the 
idea of a “corrected deathrate”; that idea is, however, essentially imperfect and 
does not really distinguish effectually between differences in the manner of dying. 

(3) We now ask whether it is not feasible to interpret the value of Q in 
Eqn (ix) as a measure of the probability of a differential mortality without regard 
to the theory of “corrected deathrates.” 

As before let d, be the number of deaths in the age group of size a, in one 
district or class. We suppose this to be a sample of a population of which p, 
is the chance of dying in this age group, then if there be w agé groups in each 
district or class, we have 2u deviations, all of which are independent, and given by 
d, — p,a, and d’,— p,a’,. These deviations have respectively standard deviations 
Va,p,q, and Va’,p,q,. Accordingly if every deviation be measured in terms of 
its standard deviation we shall have a second moment coefficient given by 

sl [s (% ae Petal) is (a = ale 
Qu 4. Pss . @sPsQs 
and &? ought to be unity. 

Now in the above expression p,, g, are the unknown values of the death and 
survival chances in the population from which the two districts or classes are 
supposed to be sampled, and the best values we probably can take for them are 
Ps = 1—4q,= (d, + d’,)/(a,+ @’,). But inserting these values in the above 
expression, we find with the value of Q given in Eqn (ix): 


: J si 
2 a -— Qe - . 
>» Ju Q@, or Z=Q~x a" 
But the standard deviation of a second moment coefficient is V p14 — b,?/Vn 
roe <eee /2 
and this in the case of a normal distribution (4, = 3u,*) equals / , Ha» 80 in our 


case, since 2 = 1 and n = 2u, the standard deviation of Z? equals 1/Vu. Thus 
we have to measure ? — 1 in terms of 1/Vu, or to ascertain the probability of 
a ratio of deviation to standard deviation of magnitude greater than ($Q? — u)/Vu. 
We see again therefore Q arising as a constant which naturally determines the 
mortality resemblance or difference of the two districts. But while in the 
previous approach to the solution of the problem from the corrected deathrates 
we found Q alone sufficed to obtain onr criterion, we require in the case of 


* Another factor, sometimes overlooked when deathrate after correction is taken as a measure of 
local health, is emigration. If the sth age class tend to migrate from A to 8, it is usually the healthy 
who migrate; thus the deathrate of the sth class in A will be inflated and that in B reduced. If the 
migration be chiefly that of males, as to mining districts and colonies, a spurious correlation between 
deathrate and sex ratio may be created. 
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this second criterion to make use explicitly as well as implicitly of the number 
of age classes. We can thus give a physical meaning to Q, V2Q is the ratio of 
the standard deviation of the 2u age classes’ deathrates—each supposed to be 
drawn from the same population and measured in terms of its own standard 
deviation—to the standard deviation of this standard deviation, i.e. to 
ms It by no means follows that this second criterion will give the same 

u 
_ result as may be drawn from the first. But this new aspect of Q frees us from 
many of the difficulties essentially associated with “corrected deathrates” and 
the indefinite category of a standard population. We must observe, however, 
that to evaluate the probability of the occurrence of Q we have again to justify 
the assumption that &? will follow a normal distribution. We could not justify 
this for the distribution of deaths in any single age group, nor even for the 
distribution of factors like (d, — p,a,)*/(a,p,q,) and (d’, — p,a’,)*/(a’,p,q,) summed 
for our two populations on one occasion, but we can do this for the distribution 
of &* on the basis of a number of random samples*. 





(4) We can again approach the problem by aaa quantities like 
4y— Pete ong Te— Pas 
V4, 24s Val sPss 


The mean of these deviations measured each in terms of its own standard 
deviation should be zero; and the standard deviation of this mean should be 


1 
——., since there are 2u variates and, each being measured in terms of its own 


V2u° 


standard deviation, the standard deviation of the series is as before unity. Thus if 


aie ae d, — p,a fe Pee 
= — §,* (“: 8 ) + 8," os Bes) |, 
mn Bae | Vas Ps4s ‘ eae ) 
1 = 
m/( Js) =V2u x m may be looked up in the tables of the probability integral, 


and the probability of the system, as a result of random sampling from a population 
Ps» Ys, thus again determined. If we use the values of p, and q, so often adopted 
above, the quantity with which to enter the probability table, i.e. the ratio of 
deviation to standard deviation, is expressed by 


(7 - “!) (Wa, _va, 0) 
a +e (1, TTY | 


fe - 
V2uxm= 





where |, = a, — d, and l’, = a’, — d’, are the survivors in the sth age groups. 


* We can show in a manner similar to that on p. 161 that the distribution of u,’s approaches the 
normal when each of the constituent x*’s is drawn from different populations, none of these populations 
being in themselves accurately normal. 
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(5) The previous methods are all more or less inadequate; they test whether 
a certain single character of the distribution does or does not present significant 
difference in the two populations compared. If we want to test the distributions 
as a whole, we must adopt a modification of the method given in Biometrika, 
Vol. vit. p. 250 for determining the probability that two systems of frequency are 
random samples of the same population. It has been shown in a memoir dealing 
with partial contingency * that the proper test is to determine 


, (‘: ) 
G0 4 ar 








aa’, (p+p’)? a, a, 
2— §.u i__* _— ; Fb cccccccce xl 
Xo * \ (a, + @’,)* PP (d, + d’,) (1 = d, + =) “ 
8 8 a, +a’, 


entering the Tables of Goodness of Fit with n’=s+1. Here p and p’ are the 
numbers in the sampled populations. Now the factor 
aa’, (p+p') 
(@,+a')? pp’ 
is for most practical purposes unity, as we shall illustrate in the sequel. Hence for 


such purposes 
» (4,_ ,\? 
a,a’, (= — “*) 
- 7 e 2 io FP atsnntiivetntel 
+a) (1 Be) 
of our earlier investigations. Thus to obtain the probability of the two districts 
being random samples of the same population all we have to do is to look up Q? 
for n’ = u+ 1, u being the number of age groups, in the Tables of Goodness of 
Fit. It will be seen that Q has a wide meaning, and generally speaking we may 
treat it as a constant, which can be used in a variety of criteria for testing the 
existence of differential deathrates. We propose to illustrate this in the following 
sections, applying in each case the four tests discussed above, namely: 
(a) The probable error of the “corrected” -deathrates’ difference reduced to 
the standard population of maximum difference. 
(b) The significance of the means of the 2u deathrate classes. 
(c) The significance of the squared standard deviations of the 2u deathrate 
classes. 
(d) The general x,? test of partial contingency with its approximate value Q?. 








The following illustrations have been used: 

(i) General Deathrates of Liverpool and Birmingham. 

(ii), Cancer Deathrates of Edinburgh and Dundee. 

(iii)—(viii) Cancer Deathrates for all England and Wales divided into four 
groups: (a) London, (b) County Boroughs other than London, (c) Urban Districts 
other than (a} and (6), (d) Rural Districts. 

(ix)—(xiv) Diabetes Deathrates precisely as in the preceding Cancer case. 

* Biometrika, Vol. x1. p. 157. 
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Illustrations. (i) We have for Birmingham and Liverpool the following data 
for all Males in 1911: 




















| Birmingham Liverpool 
Age Group l Wena 

Population | Deaths Population Deaths 

| 0— 32,552 | 2003 45,889 3117 
$-~ 58,653 161 78,518 326 

= 48,431 162 62,751 309 
25— 47,212 252 58,216 476 

| 35— 37,897 382 47,711 632 
45— 25,431 | 454 32,664 775 
55— 15,384 575 20,198 944 
65— 7,535 || 511 10,215 904 

| 7.— 1,944 | 321 2,194 335 
— 173 58 191 62 
Totals 275,212 | 4879 358,547 7880 











Our problem is to discover whether there is actual differentiation between these 
two systems of deaths, and if so, what is the measure of it. We work with ten 
age groups. 

The actual arithmetical work is indicated on the following page. It leads 
M'—M 


Om’-M 


to Q? = 165-5031 or Q = 12-8648. Hence applying first test = 12-8648. 
Or: the difference of the two deathrates corrected to the population of maximum 
ratio is no less than 12-86 times the standard deviation of the difference. We 
conclude that the charice of such a difference arising from random sampling is 
enormous*, or the two deathrates are most certainly and markedly different. 


There is no difficulty in correcting the deathrates to the population of maximum 
difference as standard, but it is of interest to note what happens, if the standard 
population has other values. 


For example, when we correct to the male population of all England and Wales 
for 1901, we find using the formula of p. 161 
M’-M 


Oom’—M 


= 10-0198. 


If we use the general male population of England and Wales in 1911 we find 


=.= S wee. 


* The chance is approximately 3-3508/10°*. This test is practically valid in this case for the general 
deathrate of Liverpool males is greater at all ages except 75 onwards than that of Birmingham and the 
age groups above 75 contribute nothing of importance to the value of Q2. 
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These values of course lead to the same conclusion but show that the ratio 
(M’ — M)/oy-y varies from standard population to standard population, and 
may be increased more than 20 % when we pass to the population of maximum 
difference as standard. Such an increase might be of considerable importance in 
our estimate of differentiation if the ratio (M’ — M)/oy-_y lay between 1-8 and 
2-2, say. 


Our third test is given by the formula on p. 165: 
(4Q? — u)/V/u = 72-7515/V10 = 23-01. 
The probability of a deviation as great or greater than this arising is immense. 


Thus we see that the distribution of the squares of the actual deviations is 
excessively improbable. 


Proceeding to our second test we find 
m = — -16625, 
and accordingly ii / ( =) = +745, or m does not differ significantly from zero. 
u 
Thus by this test no essential difference would be indicated. This does not show 
that it does not exist, but only indicates the inadequacy of the test. In fact since 
d, — p,a, + d’, — p,a’, = 0, 


and there is no great difference between Va,p,q, and V/a’,p,q,, m tends to be zero, 
even with considerable differences between d, and p,a, or d’, and p,a’,. 


We have seen that the value of Q? in the Liverpool and Birmingham case is 
165:5031. We will now investigate the value of the factors 


(2 +) /(mteey 
PP//\p+p/. 


for the ten age groups. They run 











Age Factor Age Factor | 
0—5 -98817 45—55 | 100182 | 
| 6—16 99626 55—65 | 99895 | 
| 15—26 1-00071 65—75 | -99440 
| 25—35 1-00652 75—85 | 101376 
| 35—465 1-00422 85 and over | _—1-01510 
} | 








It will be seen that the factors differ very little from unity and introducing them 
into the several terms of Q? we find 


Xo? = 165-4695, 


or xo? only differs by 0-02 % from Y. Applying the test for goodness of fit to 
xo? for n’ equal eleven groups* we find 


P = 2-40/10®, 


* See Tables for Statisticians, p. xxxiii. 
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or the odds against Liverpool and Birmingham general mortality experiences being 
samples of the same population are gigantic. This test which takes the distributions 
as wholes seems to us absolutely conclusive. The differences in deathrates in 


Liverpool and Birmingham are not due to difference of age constitutions, but are 
fundamental *. 























(ii) Cancer Deathrates in Edinburgh and Dundee, 1891-1900 (Males) f. 
Edinburgh Dundee 
Age Group - 
Population Deaths, Cancer Population Deaths, Cancer 
} 0—5 149,763 5 89,775 6 
6—15 280,655 8 168,510 2 
15—25 274,343 18 142,917 7 
25—35 214,063 47 98,953 10 
385—45 158,133 117 76,132 47 
45—55 115,206 295 59,066 114 
55—65 69,954 356 37,337 151 
65—75 32,966 266 16,958 109 
75—85 10,311 93 4,625 30 
85 and over 1,045 8 535 3 
Totals 1,306,439 1213 694,808 479 

















Proceeding as before we find Q? = 28-4837. 
to give x,” from the Q? terms for the separate age groups are given below. They 
differ more from unity, but some being in excess and some in defect, there is no 


substantial difference between x,” and Q?. 


The factors for the age classes 














We have Xo? = 28-0393, 
Or, Xo” is 1-6 % less than Q*. 
2 | 3 / + “\3 
wel , 4,0", | (0,40, Factor: 2s (“+e*) 
Age Group Factor : 4 (te) Age Group ‘actor oa | \ poe 
0—5 1-03386 45—55 *98855 
5—15 1-03427 55—65 1-00711 
15—25 -99360 65—75 -98966 
25—35 95387 75—85 -94318 
35—45 -96788 85 and over *98812 

















We now apply the same tests as before: we have 
(M’ — M)/oyw-u = Q = 5°3370. 
3 (1 + a) = -99999,99527, 


Hence 








* This only means that the differentiation cannot be accounted for by age differences, it: might well 
be accounted for by class or occupation differences. 
+ The results are sums of ten years’ population found on assumption of arithmetical progressicn. 
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or, the chance of a deviation so great as this appearing is only 4-73/108, or the 
“‘corrected” cancer deathrates for males are significantly different in Dundee and 
Edinburgh. 


Using the General Population of Scotland, 1891-1901, as a standard population, 
we find 
(M’ — M)/oy-m = 4°7399. 
Using the General Population of England and Wales (1901, males) as a standard, 
* we find 
(D1’ bad M)/oy-m = 4-7630. 
These again both mark significant deviations in the corrected deathrates, but 
fail to give the maximum of significance. 


Now let us apply the test of distribution of squares of differences. We have 
(Q? — u) Vu = 4-24186/V10 
= 1-3414. 
Such a deviation would occur about once in ten trials and is not necessarily 
significant. 
Again applying the test of mean value, we have 


m= — °1564, 
fl 
and accordingly m/s ) = -6994, 


and this is an insignificant ratio of the deviation to its standard deviation. 


We now turn to the x,” test, where we have y,? = 28-0393. The Tables of 
Goodness of Fit provide for n’ = eleven: 


P = -00178, 


or the odds are nearly 500 to 1 against such a deviation on random sampling. 
We conclude that there is a significant difference between cancer mortality in 
Dundee and Edinburgh. It is noteworthy that the corrected deathrates criterion 
which when analysed seems so very unsatisfactory gives here as in the case of the 
Liverpool and Birmingham General Deathrates far greater significance to the 
observed differences of mortality. 


We have not considered it-worth while to investigate for this case the test 
of the significance of the mean of the 2u deathrates. It is we believe inadequate 
and further is laborious to calculate. It is, we hold, sufficient and more enlightening 
to calculate x,”, and, what is almost deduced in the same process, the quantity Q?. 


There is a further point which may be illustrated on the Liverpool-Birmingham 
and Dundee-Edinburgh data. We have supposed in the course of our work that 
(d, + d’,)/(a, + a’,) is an extremely reasonable value to give to p,. If we assume 
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that the distributions of quantities like d, are given with sufficient accuracy by 
the normal curve then the probability of the whole result observed is given by 


1 Ma Pat hy 1 LN @e- Pay 
II = Product |= ———e 2 4Pi% ic e =e 2% oP.% | V ...(xii), 
V 4525s Va’ sP.ds 1 
where V is the continued product of the differentials of the d,’s and the d’,’s, 
and we require to make this a maximum for the variation of all quantities like p,. 
In other words we have u equations found by differentiating the above expression 
for Py, Po, ++» Ps» «+» Pn (Of course g, = 1 — p,) to determine the best values of these 
quantities. Taking the logarithmic differential of the above product with regard 
to p, and equating it to zero, we find since 5g, = — dp,: 


1 1 + d, — P,%, + d’, =n Pst’ s 1 = = PsA)" , + (d’, - bal 





as)" ) 
O0=-——-+ = 2, — ds) 
Dt Pads 2\ apegt ~  a'ypytqe )\Pe— %) 
Hence: 
_ 4d, + d, 1 Ee ie is ; {= Ps%s)* (d’, = Ps’)? 9, ae 
% -* = a’, 2 (a, + @'.) IsPs as i a’, = 2pah cinaeied 
‘ ae d,+d', 
Now assuming to a first approximation p, = ee jaa find to a second 

‘8 8 


approximation 7, 


aul (© “ <.)' 
d,+d', ~ oe 1 1 es ES 


Ds — a 9 D 
a,+a', ~ a +a (d, +a’) (1% + 4s) 




















Calling as before Q2 = 
: d,+d,\’ 
(4, +4) (1-53) 
d,+d’ 
) el =o nn DS 
. &+¢€, (1 oS) 29") 
we have p; = oP ot , 
a,+a@, a,+a@, 
Hence we may write p, = d,+d i 55 : 
; a,+a, 


this amounts to altering the deaths by 
d,+d', 


8, = (1-27) (1 — 39,2), 
which is usually a very small number, or we may write 
, a+, ee 1 2 
7 ae . — (1 — 3Q,) (a+ ae male 


a, + a’ 
where the factor 


2 
f= 1-0-4009 (a9 aya) 
will only differ slightly from unity. 
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The following table gives the values of p,, p, for the various age groups in 
the case of (a) Liverpool and Birmingham for General Deathrates, (b) Edinburgh 
and Dundee, Cancer Deathrates, (c) Rural and Urban Districts, England and 
Wales, Cancer Deathrates. 


Illustrations of Approximate Deathrates for unknown sampled Populations. 























(a) Liverpool and (b) Edinburgh and | (c) Rural and Urban 
Birmingham, General | Dundee, Cancer | Districts, England and 
A Deathrates | Deathrates | Wales, Cancer Deathrates 
ge Group | | 
es = a, Se 
| 
Ps %® | DB P, | Ps Bs 
0—5 -065,272 065,332 | -000,046 000,045 |). ss a 
5—15 003,550 | -003'611 | .000,022 | .000,021 | 000,025 | -000,025 
15—25 004,236 -004,299 -000,060 -000,058 | -000,033 | -000,033 
25—35 -006,905 -007,039 | -000,182 -000,187 | -000,116 | -000,116 | 
35—45 ‘011,845 -011,937 -000,700 -000,698 000,404 000,404 | 
45—55 -021,155 ‘021,335 | -002,347 -002,360 | -001,500 ‘001,500 | 
55—65 | 042,690 042,905 | -004,725 -004,726 | -004,256 | -004,256 | 
65—75 079,718 ‘080,269 | -007,511 -007,532 | -007,787 | -007,788 | 
75—85 -158,531 -158,498 | -008,235 -008,252 009,653 | -009,646 
85 and over -329,670 -328,756 | -006,962 -006,405 | -008,856 | -008,847 








It will be seen that the corrective factor contains the inverses of the total 
number (d,+d’,) of deaths and the total numbers of individuals (a,+ a’,) in 
the combined age groups s and s’. Hence for big districts as in (c) the 
corrective factor is of small importance even for special diseases. For the 
general deathrates in two large towns as in (a) the difference between p, and 
p, is as a rule less than 1%. Even in special diseases in towns of moderate 
size, it is only where the total number of deaths in the combined age groups 
s and s’ is very small that any substantial divergence between p, and @, 
arises, e.g. in (b) for the child or extreme old age groups. Thus the value of Q? 
is hardly likely to be modified practically, ii we replace ~, by p,. Accordingly 
d,+d’, 
a,+a’, 
better value ,. Of course #, itself is only an approximation * to the “best value” 
and this “best value” also depends on the accuracy of replacing the binomial 
by a normal curve. Thus it is by no means certain that, if we obtained a true 
best value for the deathrate in the unknown sampled population, it would be 
markedly nearer to j, than to p,. We content ourselves by remarking that 
neither for practical nor theoretical reasons does there seem likelihood of a great 
gain resulting from taking any other value than (d, + d’,)/(a,+ a’,) for p,. 





, besides being easy to calculate, is a reasonable approximation to the 


* We should have to solve a cubic for each age group to find the accurate “best value,” on the 
above hypothesis of normality. Doubts may also be raised as to the legitimacy of the theory which 
makes II in (xii) a maximum. They are discussed in another paper. 
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(ili)—(vili) Cancer Deathrates for all England and Wales. 

The object of the present illustrations is to ascertain whether there are 
significant differences in the cancer deathrates associated with urban and rural 
conditions. We divide the data into four groups: (a) London, (b) County 
Boroughs other than London, (c) Urban Districts other than County Boroughs, 
(2) Rural Districts. We compare pair and pair these four groups in order to 
ascertain the degree of their significant differences. 


The following data are taken from the Registrar-General’s 76th Annual Report 
and are for the year 1913*: 


Populations and Cancer Deaths in Age Groups, 1913. 

















Populations 

| b) County c) Urban d) Rural 
Age Group | (a) London yrs Oistricts Ciatriote 
0—15 648,361 1,796,847 1,976,078 1,242,634 
15—25 383,601 1,006,039 | 1,126,179 716,323 
25—35 362,967 938,541 | 1,017,991 580,305 
35—45 293,427 766,070 | 836,306 496,185 
45—55 213,635 536,949 586,962 395,443 
55—65 131,423 331,209 369,840 276,814 
65—75 68,124 166,717 200,982 180,029 
75—85 18,830 43,125 | 60,407 65,453 
85 and over 2,239 4,493 | 7,361 9,125 





Cancer Deaths 





























0—15 | 25 31 | 47 | 34 
16—25 | 24 39 34 27 
25—35 | 51 128 121 | 65 
35—45 130 416 353 185 
45—55 | 472 1075 971 503 
b65—65 723 1700 1675 1077 
65—75 | 655 1312 1607 1360 
75—85 209 450 576 | 639 

85 and over 29 29 73 73 
rane = lhe | i 
Populations | 2,122,607 | 5,589,990 | 6,182,106 | 3,962,311 
Total deaths | 2,318 | 5,180 5,457 | 3,963 

ees +— ee | 

Crude death- | | 

rates per | 
100,000 } 109-21 | 92-67 88-37 100-02 

| 


= 





How far are these differences in the crude deathrates really significant, when 
allowance is made for age groups? Above all: what is the numerical measure 
of this significance in the six cases? The following table gives the deathrates 
per 100,000 at each age in the four groups: 


* See pp. 4, 217, 234, 253, and 271. 
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Male Cancer Statistics, England and Wales, 1913. 

















(6) County (c) Urban (d) Rural | 

Age Group (a) London Boroughs Districts Districts | 

Cpe pate sce “| 

0—15 3-86 1-73 2-38 2-74 | 

15—25 6-26 3-88 3-02 377 | 

25—35 14-05 13-64 11-89 11-20 | 
35—45 44-30 54:30 42-21 37-28 
45—55 220-94 200-21 165-43 127-20 
55—65 550-13 513-27 452-90 389-07 
65—75 961-48 786-96 799-57 755-43 

75—85 | 1109-93 1043-48 953-53 976-27 | 

85 andover | 1295-22 645-45 991-71 800-00 | 

| 

Boe 7 ee ee Tae —— 

Corrected | 

deathrates* 111-48 101-01! 91:97 | 82-24 

Crude | } | 
deathrates 109-21 92-67 88-27 | 100-02 

| 








In the first place we take the “corrected deathrates” reduced to the male 








population of 1913. We have: 
Pair | M’-M | ony |(M’-Mloye-y | 
London and County Boroughs | 10-47 2-666 | 3-93 6-96 
London and Urban Districts ... | 19-51 2-522 7-74 8-95 
London and Rural Districts ... .... | 29-24 2-536 11-53 13-09 
County Boroughs and Urban Districts 9-04 1-848 4-89 7:39 
County Boroughs and Rural Districts 18-77 1-953 9-61 12-25 
Urban Districts and Rural Districts... 9-73 1-829 5-32 6-76 














To begin with it will be seen that there is a very considerable difference in the 
values of (M’ — M)/oy_y and Q, or the reduction to the general population of 
males gives nothing like the same intensity of significance to the differences 
between the means as the reduction to the standard populations of maximum 
difference. We are unable to determine what is the standard population of real 
maximum difference in any case}, and this very fact seems to discredit the use 
of the deathrates corrected to an arbitrary standard population as a means of 
adequately testing differences in mortality. For, although in this case all the 
differences of the deathrates corrected to the general male population are 
significant, in the next case—for example diabetes—they may not be, while the 
differences of the deathrates corrected to the standard population of maximum 
difference may be—-as in the case of diabetes practically, they are—of significance. 

* The “corrected deathrates” are here reduced to the assumed male population of 1913 of all England 
and Wales. © They differ therefore somewhat from the Registrar-General’s “corrected deathrates”—110-0, 
98-9, 90-2, 80-6 respectively. which are deduced from the general population of England and Wales, 1901. 

+ Q may be deduced from an unreal population of maximum difference, i.e. one with some age classes 
negative. 
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{f we consider the order of the significance deduced from the two standard 
populations we have: 








Q Pair (mM M)|oy_ 
13-09 London and Rural Districts ... 11-53 | 
12-25 County Boroughs and Rural! Districts 9-61 

8-95 London and Urban Districts .. 7-74 

7-39 | County Boroughs and Urban Districts 4:89 

6-96 London and County Boroughs ve 3°93 

6-76 Urban Districts and Rural Districts 5°32 











The Q order shows that London and the County Boroughs are markedly 
differentiated from the Rural and Urban Districts—in the higher degree from the 
former. Thus the city districts tend most to a high cancer rate. London is 
differentiated from the County Boroughs, and the Urban from the Rural Districts, 
but to a less extent than in the previous cases. The (M’ — M)/oy4’_y test confirms 
this order except in the case of Urban and Rural Districts which are now more 
highly differentiated than County Boroughs and Urban Districts. Thus we note 
that a particular standard population may not only influence the significances of 
the differentiated deathrates, but also the order of these significances. This result 
will be confirmed in the case of diabetes. 


The following are the values of Q? and x,?: 





Pair Sa 
| London and County — w. | 48-4567 | 49-2414 
| London and Urban Districts .. ae 80-0869 80-6986 
| London and Rural Districts ... | 171-3498 | 164-2686 
| County Boroughs and Urban Districts 54-5901 54-3279 
County Boroughs and Rural Districts 149-9172 151-0230 | 


Urban Districts and Rural Districts... | 45-6922 46-5109 | 





It will be seen that the only considerable difference between Q? and x,” arises 
in the case of the London and Rural Districts pair, and more than half this 
difference is due to the very heavy de&thrate in London from cancer of persons 
between 65 and 75. 


Before we discuss the inferences to be drawn from these results, we will place 


— 1 ‘ 
on record the ratio of m to its standard deviation Van" The easiest way to 


calculate this ratio is from the components 


a,a’ Ga) 


a, 





iat <-. +d',) (i-$ i. 


Biometrika x1 
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Va’, —Va, 


We have m= — ~ 8," ee x Qf, 


where Q, must be given the sign of : _ =. The advantage of this method is 


that the value of Q, has usually been tabled as a stage to the finding of 
Q? = 8,“ (Q,?). In the present case we deduce: 


Arithmetic Value of wm ke Ve 


London and County Boroughs __... ise : ee eee 1-00 
London and Urban Districts ee ie ate ees sae ese 1-94 
London and Rural Districts ed nn he or se < 1-94 
County Boroughs and Urban Districts... nie wie wei ae 0-06 
County Boroughs and Rural Districts... vs oa ose ave 0-49 
Urban Districts and Rural Districts eee ees wes oes a 0-26 


We can now calculate the probability that each pair are random samples of 
the same population by aid of our four tests. We place the pairs of districts in 
order of their improbability as random samples of the same population, taking 
Xo” as our standard test. 


Probability P of the Districts being Samples of the same Population. 











| 
| Pieter, | toetcom | tet eon | 
| n | istri i no | 
Paired Districts compared | G Xos aay t | Corrected "Soe « Guciiiens 
for Cancer Mortality | — "Test. | _ Deathrates Deviations = 1 
. m == 
| Or AMM Gt —w)/a/e / (se) 
| | 
i } 
London and Rural | | | 
Districts ee. | -9625/10°° 1-8031/10°° . 1-9628/10'44 +3121 
County Boroughs and | 
Rural Districts... 5414/1027 -7929/10%4 1-9385/10 |  -0262 
London and Urban 
Districts... 1-1740/10"3 1-7497/10" -2555/1028 0262 | 
County Boroughs and | 
Urban Districts... 1-6355/10° -6606/101* -5133/10° -4761 
London and County 
Boroughs... vl 1-4996/10?7 |  1-6586/10! ‘1927/10 =| +1587 
Urban Districts and | 
Rural Districts... -4214/107 6327/10" +1962/105 | +3974 





The probability* in the first of these tests is deduced from the Goodness of 
Fit Tables; in the remaining three tests from the Tables of the Probability 
Integral. It will be seen that the first three tests give absolutely the same order 
of significance for the six pairs of differences. The fourth test is irregular and 


* The vety high improbabilities given are only rough approximations, sufficient, however, for our 
present purposes. Since we enter the Goodness of Fit Tables with n =ten (i.e. u+1), we must use the 
first value of P in Equation (xxix), Tables for Statisticians, p. xxxi. The integral J may then be 
calculated by the first term of the Schlémilch formula (ibid. p. xxxiii), as this integral will only affect 
the fourth figure in the decimals. Table IV (ibid. p. 11) has been used to approximate to the extreme 
tails of the probability integral. Very useful work could be done by extending this Table between 5 
and 50 to the first decimal place in the argument. 
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confirms the view already expressed that but little is to be gained from its use. 
The fact is that each test measures a different feature of the difference of the 
distributions and the worst test will be that which measures the least important 
characteristic. There is little doubt that the deviation from zero of the mean of 
all the deviations measured in terms of their S.D.’s is this characteristic. The 
second test measures the significance of the “corrected” deathrates for the standard 
population of maximum significance gives results which are most closely in accord 
with the x,?, but it suffers from two rather serious defects: (i) it is conceivable 
that M might be very close to M’ and yet the actual distribution of deaths very 
different, (ii) the standard population which gives the maximum significance to 
the difference of the corrected deathrates may, as we have indicated (p. 164), be 
an impossible one. Hence the values of the significance may be very considerably 
exaggerated. It has been our experience, that when we have taken other standard 
populations, we have found the Q? considerably less than for the population of 
maximum significance, but not always most markedly less. 


On the whole we think a test which considers the general distribution of 
deviations more likely to show definite results, than one which considers only 
a mean, and from this standpoint we hold that the x,? and (4Q? — u)/Vu are the 
better criteria. The main assumptions on which these tests are based are for the 
latter: that the distribution of squared deviations (each measured in terms of 
its own §.D.) will, even if each deviation be selected from a non-normal frequency, 
give a second moment distribution which follows the normal law; and for the 
former: that the Gaussian curve accurately enough describes the frequency given 
by a binomial. This assumption would be more closely fulfilled by a general 
than by a special disease deathrate, but is probably more valid than the previous 
assumption. Hence we believe that while the three first tests have all a certain 
value, the first and third are to be preferred and the first is best of all. 


Judged by the first, second and third tests we conclude that significant 
differences can be definitely said to exist between all these cancer mortalities, 
Urban and Rural Districts showing the least but still a very weighty significance; 
that London and the County Boroughs other than London have significantly 
different cancer mortality, while both London and the County Boroughs differ 
conspicuously from the Urban and Rural Districts. 


It is clear that the degree of significance is closely associated with some variate 
which increases with difference of position in the scale (a) London, (6) County 
Boroughs, (c) Urban Districts, (d) Rural Districts, i.e. with some factor which 
increases with the city character. The increase during the past fifty years in the 
cancer deathrate has been associated by some with improved diagnosis. Is the 
variate correlated with the above order that of better diagnosis? It may, perhaps, 
be doubted whether the general practitioner is much more competent in cancer 
diagnosis in London now-a-days than in the Rural Districts. But an examination 
of the terms of Q? or x,? shows that nearly half of the significant difference arises 
from the terms Q,? and x,” corresponding to the 45 to 55 group and nearly a third 
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corresponds to the 55 to 65 group. In fact in the case of London and Rural 
Districts about three-fourths and in the case of County Boroughs and Rural 
Districts about four-fifths of the value of Q? are contributed by the age groups 
45 to 65. This is not the period in which the total number of deaths from cancer 
is a maximum; it may therefore be the period in which the less skilled medical 
man is less likely to diagnose cancer. It would be equally valid, however, to 
assert that cancer finds more susceptibility in town than in country dwellers and 
that this is particularly the case with men from 45 to 65 years of age. Further 
‘ suggestions are, of course, emigration of cancerous persons to the towns* and 
the presence of certain occupations with high cancer deathrates in the towns. 
In both these cases we must find explanation for the particularly marked contribu- 
tions for the age groups 45-65, unless we are content with the view that these are 
the age groups where the cancer deathrates are fairly high, and the population of 
the group with which the result is weighted fairly large. 

It would be in every way desirable if we could, applying ‘‘Occam’s razor,” 
attribute to one source the rising cancer deathrate and the significant differences 
between cancer mortality in cities and in rural districts. Both may be due to 
differential diagnostic power or to varying accuracy of certification, but we gain 
little by merely throwing out suggestions, and omitting to demonstrate them. 


(ix)—(xiv) Diabetes Deathrates for all England and Wales. 


As a last illustration of the present method we take the deaths from Diabetes 
for the year 1913 from the Registrar-General’s Report} in the same fundamental 
groupings; of course the populations at risk in age groups will remain the same. 


Diabetes Deaths in Age Groups, 1913. 














| ay b) Count | (c) Urban d) Rural 
itt ee ys Bonn | Osatriots (Districts 
| 0—15 8 | 22 | 22 26 | 
15—25 12 35 46 38 | 
25—-35 i 20 60 69 38 
35—45 21 63 66 44 
| 45—55 49 86 102 51 
55—65 59 158 196 106 
65—75 47 {78 181 137 
| 75—85 | 16 47 57 50 | 
| 85 and over | 0 5 6 1 | 
Total Deaths 232 654 745 491 
Crude deathrates 10-930 11-700 12-051 12-392 
Corrected deathrates | 11-099 | 12-649 12-433 10-948 
| (Both per 100,000) | | 








* It must be remembered of course that institutional deaths are since 1911 distributed to their locus 
of origin, and therefore immigration for operation no longer tends to swell the London or County 
Boroughs cancer deathrates. It would be otherwise with immigration for permanent residence in the 
initial stages of the disease. But there is no evidence at present to show that cancerous persons of 
ages 45 to 65 do move into the larger towns. t See pp. 217, 235, 253 and 271. 
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The “Corrected” Deathrates in the table on p. 180 are based on the Age Groups 
of the total population for England and Wales in 1913. It will be seen that they 
modify entirely the order of the Crude Deathrates, which order appeared at first 
sight to be comparable with the cancer results, as showing a difference between the 
city and rural or urban areas—in this case indeed to the advantage of the former. 
Proceeding as in the case of cancer to compare significance of corrected deathrates 
when reduced to the general population of males (1913) as standard and when 
reduced to the standard population of maximum difference we find: 





| 


Pair | M’-M 




















om’—M | (M’ - M) joy Q 
| 
London and County Boroughs : — 1-550 | 8898 | 1-74 4-16 
London and Urban Districts ... we | — 1-834 8794 | 1-52 3-16 
London and Rural Districts ee +0151 | -8829 | 0-17 3-95 
County Boroughs and Urban Districts + 0-216 6727 | 0-32 2-34 
County Boroughs and Rural Districts + 1-701 6998 2-43 5°76 
Urban Districts and Rural Districts ... + 1-485 | -6790 2-19 5-03 





Now with the single doubtful result for County Boroughs and Rural Districts 
none of the values of the ratio (M’ — M)/oyy-_ can be definitely cousidered as 
rendering M’ — M significant. On the other hand all the differences of the corrected 
deathrates reduced to the standard population of maximum difference must, with 
one possible exception, i.e. County Boroughs and Urban Districts, be considered 
as markedly significant. This illustration is of great interest. For it is quite 
easy to select a standard population where for real age classes there is a significant 
difference between the “corrected” deathrates of London and the County Boroughs, 
but reduced to the general population (males) for 1913 there is no such difference. 
We see therefore that the reduction to an arbitrary standard population may be 
absolutely misleading as a means of testing whether two class deathrates are 
differentiated. Again the order of pairs for significance in the case of diabetes 
is for Q: 














Q Pair (M’ -M)/oy_yy 
| 
& a Sc ee = S see 5 
5-49 County Boroughs and Rural Districts ... 2-43 
5-03 Urban Districts and Rural Districts. ... 2-19 
4-16 London and County Boroughs ... one 1-74 
3°95 London and Rural Districts... ae 0-17 
3°16 London and Urban Districts _... eos 1-52 
| 2-34 County Boroughs and Urban Districts 0-32 











It will be seen that the significance of London and Rural Districts is considerably 
displaced by the general population as standard. Or, we conclude, as in the case 
of cancer, that relative significance as well as absolute is determined by the special 
standard population selected and we cannot accept the current view that the 
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“standard population” to which the deathrates are corrected is a matter merely 
of convenience. - 


It is clear that there is nothing in the “corrected” deathrates to indicate in any 
group a substantial difference from the rate 11-883 per 100,000 for all England 
and Wales*. It will accordingly be of some interest in this case to test whether 
this result of non-significance which flows from the deathrates corrected to the 
general population is confirmed when more accurate methods are applied to test 
the distribution of deviations as well as mean rates. The following are the values 
of Q? and xp: 





| 
Pair Q | Xo" | 


London and County Boroughs ... 
London and Urban Districts 
London and Rural Districts 


17-3188 18-0702 
10-0126 9-9831 
15-5939 15-4501 


County Boroughs and Urban ‘Districts _ 54693 | 54236 
County Boroughs and Rural Districts | 30°1470 | 30-4870 | 
Urban Districts and Rural Districts ... | 25:2912 | 25-8263 | 


| | | 





It will be seen as in previous cases that there is no substantial difference between 
Q? and x,. 

The following table gives the deathrates from Diabetes per 100,000 exposed to 
risk in each age group of the four categories: 


Male Diabetes Statistics, England and Wales, 1913. 





Age | (b) Count (c) Urban d) Rural 

Gasp | (a) London Bewesy Districts | Districts | 
ee | | 

0—15 1-23 | 1-22 1-11 | 2-09 

15—25 3°13 3-48 4-08 5-30 

25—35 5°51 | 6-39 6-78 6-55 

35—45 7-16 8-22 7-89 8-87 
45—55 22-94 16-02 17-38 | 12-90 | 
55—65 44-89 47-70 53-00 38-29 | 
65—75 68-99 106-77 90-06 76:10 
75—85 84-97 108-99 94-36 76-39 
85 and over 0 112-28 81-51 10-96 





The problem before us is again whether the above distributions of deathrates 
as a whole are or are not significantly different, and not whether the differences 
in one of their statistical constants (i.e. the “corrected” deathrate) are or are 
not significant. We shall apply only three tests, i.e. (i) the x9? test or Goodness 
of Fit Test, (ii) that of the difference of corrected deathrates for the standard 


* The deviations from the General Rate are (a) —-784, (b) +-766, (c) +°550 and (d) —-935, all 
of the order of the probable errors of the differences. 
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population of maximum difference, and (iii) the Test from Distribution of Squares 
of Deviations. 


Proceeding as in the case of Cancer we have the following table: 


Probability P of the two Groups being samples of the same Population. 

















(H) rest from | (iii) ‘Test from 
. erence o = Ae eae 
Paired Districts compared gf Mjee fA | Trec D rman of 
for Diabetes Mortality T | _ Deathrates teres 
‘est (M’- Mow Deviations _ 
gM Qt —my/a/u 
Rural Districts and 
County Boroughs -0004 20/107 | 0215 
-Rural Districts and 
Urban Districts ... -0022 25/106 1121 
London and County 
Boroughs ... aes 0344 | 000,02 -4548 
Rural Districts and 
London ... a -0801 000,04 -3442 
London and Urban 
Districts ... ike +3520 -000,7 -0916 
Urban Districts and 
County Boroughs -7943 -009,7 0184 








As before the order of significance, as we might expect, of the first two tests 
is the same*, but it is no longer as in the case of cancer the same as in the third 
test. The third test shows only significance between the members of the two pairs 
Rural Districts and County Boroughs and Urban Districts and County Boroughs, 
and neither difference is at all emphatic, while the last is out of accord with the y,? 
test. The y,? test shows distinct significance between Rural Districts on the one 
side and County Boroughs and Urban Districts on the other. The Rural Districts 
are not very significantly differentiated from London. Probably London and the 
County Boroughs have a different diabetes mortality. An examination of the table 
on p. 182 shows that the chief difference between the Rural Districts and the Urban 
and County Boroughs mortality is the much lessened deathrate after 50 years of 
age; while the difference betwe2n the Rural Districts and London lies in the 
much gteater deathrate from diabetes under 45 in the Rural Districts. Such 
differences might be so balanced that there existed no significant difference in the 
corrected deathrates. No one could judge by the corrected deathrates of 10-948 
and 11-099 with a probable error of -60 that the Diabetes mortalities of Rural 
Districts and of London are as significantly differentiated as they are. 


We see that the second test enormously exaggerates the significances determined 
by the first test and this is precisely what we might anticipate. The second test 


* The order of significance in the third test will not now be the same as in the second, although 
it depends only on Q? because (4Q* — w) can be positive or negative. It was the same in the case of cancer 
because Q? was always greater than 2u. Positive or negative values of (}Q?- ww) only signify that the 
mean of the squares or deviations exceed or fall zhort of the theoretical value respectively. 











184 On Criteria for the Existence of Differential Deathrates 


is the difference between the corrected deathrates, corrected to a standard popula- 
tion which gives the maximum difference between those rates. But this standard 
population is, if the individual age group deathrates are not all greater in one 
district than the other, an algebraical fiction and not a real standard population. 
But with the single possible exception of London and County Boroughs there is 
no approach in the case of diabetes mortality to greater deathrates in all the 
age groups of one district class. This was far more nearly the case in the cancer 
mortality. Hence the second test was more reasonable in that case. 


We have retained this maximum difference of corrected deathrates to the end 
of our illustrations, because we think it serves to indicate the danger of any 
argument as to significant differences in mortality based on “corrected deathrates.” 
In the case of diabetes, our district classes give no such significant differences. But 
our x” test shows that these differences actually exist, as indeed might be suspected, 
although their numerical valency could not be adequately tested. by a mere 
examination of the age group deathrate table on p. 182. The reason for this 
failure of the corrected deathrate difference lies largely in the fact already insisted 
on that the significance of the difference in the corrected deathrates depends largely 
upon the standard population selected—a point often overlooked. What then is 
to be the standard population selected? Clearly it should be such as (i) to make 
the corrected deathrate difference a maximum and (ii) at the same time be a real 
population, i.e. not one with negative age classes. At present we do not see how 
to reach the maximum of a certain function of variates, subject to the condition 
that the variates are to take positive values only. The corrected deathrates for 
diabetes in certain districts show no significant differentiation when reduced to the 
general population of England and Wales as standard. They show a marked 
differentiation when reduced to the standard populations of maximum difference. 
It is true that these populations are merely algebraic fictions, but how far should 
we approach this marked differentiation, if we could discover the real maximum 
difference population? We cannot say; and in view of this uncertainty, it seems 
to us needful to drop for the present any criterion of mortality differentiation 
depending on the so-called corrected deathrates. 


We can only conclude that the proper test for differentiated mortality is the 
Xo” test used for the first time in the present paper. For this test does not depend 
on the measure of the divergence between two means—“ corrected” it may be,— 
but on the general difference between two frequency distributions as wholes 
and this appears to us the essential feature of any true measure of differential 
mortality. 


We have to thank our colleagues Mr A. W. Young, Mr I. Horwitz and 
Mr George Rae for much assistance in a piece of arithmetical work more arduous 
than may appear on the face of this paper. 








ON CERTAIN PROBABLE ERRORS AND CORRELATION 
COEFFICIENTS OF MULTIPLE FREQUENCY DISTRI- 
BUTIONS WITH SKEW REGRESSION. 


By L. ISSERLIS, D.Sc. 


(1) In the systematic investigation of the statistical constants of multiple 
correlation and of their probable errors, it is important to have to hand the probable 
errors and the mutual correlations of the more fundamental constants—the means, 
the standard deviations and the correlation coefficients. For the case in which 
the frequency distribution follows the normal law this need is supplied in the 
memoir by Pearson and Filon entitled “On the Probable Errors of Frequency 
Constants and on the Influence of Random Selection on Variation and Correlation*.” 

As regards the more general case in which the regression is skew, the probable 
error of a correlation coefficient was first given by Sheppard (Phil. Trans. Vol. 192, 
A, p. 128). 

The probable error of a mean and the correlation between deviations in the 
value of the mean and that of a standard deviation, or of a correlation coefficient, 
and the correlation between two standard deviations, are given by Pearson 
(Biometrika, Vol. 1x. 1913, pp. 1-10). 

For reference we give here the results for the case of normal distributions 
obtained by Pearson and Filon in the memoir referred to above : 


Zigg al V IW -rrvcecsessevesessssevsssssscscensessseseranssnsesssensseseseseesse (1), 

Song = (1 — Pag snnereccneseseersrersensnssoresessetersensersessovesseseses (2), 

Rg ay = Vie veeseeeseeestesseeeneeeseeesseeseseesneeesaeeeeseseeseesenseeceeeeeeeey (3), 

Ra tis VP Rc ccerinsecicsminninnscnceniinitontelileaumpeatentia iad (4), 
me Tz (Tyg =a T1273) + 113 (Tie — 13193) - 

61% = SID (1 — hg) Ct eettrerevaeeeneenecncnnge (5), 

Ro oi ee =fa= 7 mut priate ee a (6), 


(713 — T1223) (724 — 723%) + (71a — 734713) (723 — —_ 
+ (13 — Tra%aa) (24 — Tiara) + (Tra — 712724) (723 — Tea aa) 


Ry atas = (i — 719) Jia 734) sei 


* Phil. Trans. Vol. 191, A (1898), pp. 229-311. 
t Phil. Trans. Vol. 191, A, Equations (xv)-(xviii), (xxxvi), (xxxvii) and (x1). 
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In the present paper the corresponding results are obtained for the case of skew 
regression. The method employed is different and by supposing the regression | 
to be linear and the distribution to be normal, a confirmation is obtained of the 
above results which in Pearson and Filon’s memoir depend on very complicated | 
analysis. 


(2) We may begin by discussing the correlation that exists between deviations 
from their means in the case of two correlation coefficients r,, and 7,;. 
We have aie ED ec cevovesasenccerceionssenseseas (8), 
Beg Ie asccwnecncenessecncseoetescenenceis (9), 


where pzym,ng is employed to denote the mixed moment of orders /, m, n, k in 
the variables, taken about the means so that 


dey _ Poy _1dpe _ dpe 


and cot om et. Se _ ee (11). 


Sere weer eerseeeseeeseee 


It is clear that we shall require the correlations between any one of p,,, p2, Py 
and any one of p,;, pa and pp». It will suffice to find the correlation between 
Pay and Pat 

Now Noy, = rr finn, (B — B) ly — FP ..ccciccsesccorsencssees (12), 


Ndpzy = SS {dn., (x = x) (y is y)} 
+ SS{—n,, (y — 7) d2} + SS{— ng, (x — %) dj} ......(13), 
zy zy 
or ig OO Be lina FEOF incesieniesntecesnivennsoases (14), 
zy 


if we denote the total population by N, « — % by X, y—¥% by Y and remember 
that 
S{(x — %) n,4} = S {(y — J) Mey} = 0. 

y 


xz 


Similarly Ndp, = S8 {dn,, SUED : iiisicvtinieiaeinsviunanesieied (15). 
The mean value of dn,,dn,, mn many samples is the mean value of 
88 {dng y ents * 88 {diy y at) 
omy S888 (dry 9,544, Maye ait} + Bayeeity l — MayyyntlN)--(18), 


where in the fourfold summation the term 


x J (dn, y, % ty)” 
is omitted. 


But clearly the right-hand member of (16) reduces to 


Ne yy ety ri Ny yy Naty|N, 
hence the mean value of 


Nd Opa: 1 Nh (Poccs — Peres) © -<+200000000secse0sss 
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Putting 1 = z we deduce from (17) 


Mean value of dp,ydp,z = (Pryez — PayPar)|N  ....eceeeeeeees (18) 
and putting y = x in this result, 
Mean value of dp zdpa = (paz — Perpy)|N  ......eeeeeeeee (19). 


If we multiply (10) and (11), sum for all samples and divide by the number 
of samples we deduce 


No, > R,., ? rel Ye v é at 


— Povet — PevPet _ 1 (Pave? — PeyPz) ce 1 (Pave — PeyPe) 


PavPt 2 PavP2 2 PavPe! 
1 (Petet — PatPat) _ 1 (Pret — PuePat) , 1 (Dore — PatPet) 


- PxPat 2 Py Pzt 4 Pepe 
+ 1 (Pae — Px Pr) m 1 (Pye ae Pv Pz) re 1 (Pye: <P Pv Pe) — (20). 
4 Px2Pe 4 Pv P2 4 PvPe 
This result like Sheppard’s formula for o*, is much simpler when expressed 
in reduced moments. Let us write 








Patymente 
o,'0,"0,"o," a Yel ym an tk» 


so that q,: is unity and q,,=7,,. The numerical term in (20) is 
—14+4(4)+3(-4) 
or zero, hence 


No,,, Cre R,.,, Tt [Tey "s t 


= fev _! (tent Gave , ater + Wvret 


1 
BY ) + 5 (qatez + Gare + yee + Gye) 





ToyT zt - Tey et 
nanan (21). 
In the same notation Sheppard’s formula becomes 
1 [Garvey ’ Yaty + dee 99)* 
7 os N {tee -}- 4 (Bs + B 2) + Ty? ._ .rre (22) ° 


To find the correlation between r,, and r,, we have only to replace ¢ by z in 
(21), thus 


No 


Toy o,., R,, Tx [a vy Tyee 


_ Weve _ 1 (feet fey a Vaz + Qevs 


Voyl xz 2 


L. 
) + 4 (Jazz? + Vat + x2 y2 + Jy?) 


Tey Vez 


(3) These correlation coefficients will simplify if the regression be linear and 
simplify to a considerable extent if at the same time the distribution be normal. 
For with linear regression 


NPiy: a = (MeyeX* yz) T 
= SS (Mpyr*y x Zz); 
zy 


where Z,, is the mean value of z for given values of z and y. 


* For the denominator of left-hand side, cf. Biometrika, Vol. 1x. p. 4. 
+ The origin being taken at the mean. 


VOL. 1t— N 
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But from the usual regression equation 
_ of Taz — Tay yz\ Fz Tyz — TezTey\ 2 
ay— 2 rg — a SF y 2 Sat. 
fay 1 Ce | ee 
so that for linear regression 
Tan — TayTyz\ Fz Tyz — TezTey\ ©. 
Pan = ( |... — Pury + - ~- = Pyhy® ovcccceceees (24). 
ey = Oy 


2 
Cy Tey 


Q) 


Further Pay= J SS (nzyx*y) 
y 


Nz 


Co 
or Paty = Po Voy — iF. Wiiainnatand Mikiten sea vaenbesast (25), 
Cy 


while qay = 1+ 1% ee — 1) (6’, — 1) approximately ......... (26)*, 
so that (23) can be evaluated approximately by the use of simple moment 
coefficients and correlation coefficients only. Jf in addition the distribution be 
normal, we know that 

Pay = 3P ay Pa and Pee = (1 7 2r*y)/PaPys 
so that for norma] distributions 


Patve _ (" az — 7 ie) 3, Tue — Tastey 1+ 2rhey 
r 


Perv Paz 1— -_ Vez 1- Ww ay lez 
Vat yz 6 
or egg Bs. . viceindisnennsesisitennanne (27). 
zy" zz 
ee mere Win Fine ; 
Similarly de Ee = AR ne ae ccmeeeieel (28), 
Vey Tey 
and Gave 1 4 Oy etey/t (29) 
r = HAV yzl ay] lag cree eereereeecereseeeneeseesesserees ad), 
Lz 


Substituting these values in (23) we obtain, after some reduction and using 
"se (1 — ry VN, 
— 72 ms 
(1 — r7zy) (1 — 1725 ) Ry ites 
= Tyz (1 —'= * oy) (1 ia rs) ai 3 Voyl ez (1 i ry es ra = 1. + 27 ryT yz" 2x) 


agreeing with (6) the value obtained by Pearson and Filon for normal distributions. 


As regards the more genefal case dealing with the correlation between deviations 
of r,, and those of r,, given by equation (22), we have when the regression is linear 


Pryzt = BSS 8 {evn (xyzt)} 
= S S S —_ (xyzt,y2)}, 
rye 


* Biometrika, Vol. rx. p. 4. 
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where ¢,,, is the mean value of ¢ for given values of x, y and z, so that as is well 
known 


Ries ae a ae ee  vesecnntamnninion (31), 
Cy Aue Oy Au oO, Aue 
where | 9S Gen he Sie 
Toy» 1, Ty2> Tyt (32) 
cdusenses eoeweneatebuaennenaen " 


Vows Tyzs a Vet | 
| Tots yt» Tat» 1 
and A,, is the minor corresponding to r,,. Thus 
Veyzt = (AntQatye aa Ayedev'z a Ast Qeve2)/ Ate evcvcceceoccoes (33), 
so that R,,, can be evaluated approximately in the case of linear regression 


without employing any mixed moments beyond the simple product moment 
occurring in a correlation coefficient. 


For normal distributions we may use (27), 28) and (29) giving 
Veyzt = — [Ax (27 ay% ez - Tyz) + Ay (2rayT ys + T22) - Ai (27227 yz + Tey) Ace ...(34). 


By well-known properties of first minors of a determinant we have from (32) 


AW + Tay Aye -- Vaz A, 4: Ten = 0 Peer eeereseseeereesesese (35), 
Tay Act 4- Av oa Tyz A, ++ ‘yt A = 0 Cocccccesecccceceseseece (36), 
Vez Ant -{- Vyz Ay + Ax =f T2An = 0 eee eee eeeeeseseseseseses (37). 


Multiply these equations by r,,, 722, Tzy Tespectively and add, 


yz? 


(T yz — 2r ez zy) Ay + (Tez sLIe yz av) Ay 


— (Tey + 21 y2T 2s) Ax + (TyeTat 7 Teal yt + Toy? zt) An = 0...... (38). 
Combining this result with (33) we see that for normal distributions 
Cis = Vag T ear Rett 1 Tealek se ceroeseecasvarennsnee (39)*, 


an interesting result likely to prove useful in other applications and probably 
capable of generalisation Particular cases of (39) are obtained by putting ¢ = 





so that gay, = yz + 2zy%2z Which is (27) andt=2,z=y giving q,:= 1 + 2r*,, 


which is well known. 


If we now substitute these values in equation (21) we find 


— 2 yeT at so 21.21 yt id 22 ye" ct aa Wee TytT ot 
a 2 aeT atl 2y ina Wye T yt? xy + VoyT st (17.5 + #4 + ry. . ryt). 
The right-hand member can be put in the form 
${(Tat a Te21 zt) (Tye = Toy ee) + (Tet a Tay yt) (Tyz — Ty st) 


7 (Te: Te FovT es) (Tye < Tye zt) 5 (Tez — TetT yz) (Tye ‘ari TeyTxt)}> 


* This result, which is accurate for normal distributions, is given as approximately true for such 


distributions by H. E. Soper, Biometrika, Vol. 1x. p. 100 
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and if we remember that for normal distributions 
o,, = (1 = r/VN, Oy, = (1 at r?..)/ VN, 


this result agrees with Pearson and Filon’s value quoted above as equation (7). 





(4) To find the probable error of a standard deviation. 





o*, as Px» 
do, dpzs 
Bon Opa Tieeietieiiriiriinnsnennnnnnnnnnnnneee (40) 
oe Pa — Pa 
Hence — = 4p? aN by (17) 
a2 
— 
+ a a arr er (41) | 
= ova 


This result is well known, and for normal distributions, i.e. when 8B, = 3, becomes 


=.= Jan , agreeing with (1). 


To find the correlation between a standard deviation oa, and a correlation 
coefficient r,,, we multiply (40) by the equation 


Gry: Pv: _ py _ dpe 
yz Puz 2py 2p." 
and sum for all samples and divide by their number in the usual way, obtaining 
2No,,0,,, Re sty,/F2" ve 
me: (Patyz na P2tPvz)|Pa2P ve a 2 (Pat a P2Py)|P2Py ya $ (Pare ss PxtPz2)| Par Pz? 
_ Uatvs _ Yary + Jats? ‘ 
= tes Be setctneasiniis snes endidanetnsreirtcanaeviessinl (42), 
a result which as before can be approximated to in the case of linear regression, and 
which for normal distributiong becomes * 


Ty2 + Way? : 
yz me eyl ee 4 (2 e 273, a 2r2,.) 
ie Tyz 
us" ye r Cy (1 ined #43) 1 
“VIN VN (Ont y2 
2P ay? 2s rill (1°24 a3 1.5) Tys 


Sean’ #4) ae Ws RR ana (43), 








agreeing with equation (5). 


* For the case z=z, i.e. Ro, ty cf. Biometrika, Vol. 1x. p. 8. 





ON THE CORRELATION BETWEEN THE “CORRECTED” 
CANCER AND DIABETES DEATHRATES. 


By C. A. CLAREMONT, B.Sc., Biometric Laboratory, University College. 


In a paper due to Maynard published in this journal* it is shown that in the 
United States in the case of both states and great towns there is a marked correlation 
between the corrected deathrates from cancer and diabetes. The point is of very 
great, interest, and additional investigations by Pearson and otherst and by 
Greenwood and Frances Wood{ confirm Maynard’s result for the material used 
by him. On the other hand Greenwood and Wood show that it is not apparently 
true for Switzerland. 


Maynard writes§: “If the increased rates observed in the cases of cancer 
and diabetes were due to a common cause then it is probable that their rates of 
growth will be found to be fairly highly correlated. Not being able to obtain 
rates for a sufficient period of time from the United States reports, the rates given 
in the Registrar’s Report for England and Wales, in five-yearly groups for the 
35 years 1871 to 1905, were used and here p = -8060 + -0893. This high correlation 
shows I think a strong probability that there is a common factor influencing the 
increase of both diseases. That this value is explicable on the assumption that 
the increased rates are merely apparent and due to more careful diagnosis is in 
view of the facts already mentioned almost inconceivable.” 


In view of the development of the variate difference correlation method since 
the publication of Maynard’s memoir, it seemed possible to test this point, and 
at the suggestion of Professor Pearson I undertook the necessary calculations. 
It is to be noted that it is partly the continuous rise in the deathrates from these 
two diseases during the last few decades that has suggested a possibly organic 
relation between them. This increase is effectively shown in the Diagrams of the 
corrected deathrates I and II for these diseases, and in the combined “spot” 
diagram for individual years, III, which brings out the contemporary rises. 


* Biometrika, Vol. vit. pp. 276-304. 

+ Journal of R. Statistical Society, Vol. 73, p. 534. 

t Journal of Hygiene, Vol. xtv. p. 83, 1914. *  § loc. cit. p. 289. 

| I have heartily to thank Miss B. C. B. Cave for revision of my arithmetic and the correction of 
several errors. 
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DracraM III. Cancer and Diabetes. 


Correlation of Cancer and Diabetes Deathrates 





Corrected Male Deathrates showing scatter of 


pairs of occurrences in the same years. 
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Cancer. Corrected Deathrate per Million 
The diagram indicates an apparent highly correlated change. 


It should be noted, however, that Newsholme and King* conclude that the 
increase of cancer is apparent only and due to improvement in diagnosis and 
more careful certification of causes of death. Whether there has been a real 
increase in diabetes is a question that these authors consider to be undecided. 
It is difficult, however, in the case of England and Wales to believe that the 
increase in the corrected deathtates for both cancer and diabetes that has gone 
on continuously since 1900 can have anything to do with increased accuracy 


* R. 8. Pros. Vol. tiv p. 228 (1893). An examination of our Diagram I shows that the increase of 
the corrected cancer deathrate from 1873 to 1893 was practically the same as from 1893 to 1913. If 
improved diagnosis is the source of the increased cancer rate it is certainly remarkable that the 
improvement should have been so uniform for the space of forty years. 
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of diagnosis of either disease in ‘this recent period, and the general sweep of the 
plotted curves seems to show that the source of the increase whatever its nature 
has been continuous since the middle of last century. 


In the following investigation I have correlated the deaths from cancer and 
diabetes for.the male population only. I had intended to deal also with the female 
population, but the results for the male were so conclusive, that I did not think 
it needful to go further. But the idea of dealing with the female deathrates led 
me to select as standard population the total male and female population of 
England and Wales as given by the 1901 census. Of course the choice of the 


Driacram V. Associated Inflexion Points in Corrective Factors and °/, Males under 45. 


82:5 109 





(or +108 


~ 






82:04 


815-4 


81-0- 


805- 


800- 


Percentage of Males under 45 
Corrective Factor for Cancer 


795-4 








790 aa 
1861 1871 





95 


oe yy T " 
1881 1891 1901 1911 


Year 


Diagram illustrating that the inflexion point in the corrective factor diagram (IV) corresponds 
to actual changes in the age distribution of the population. 


standard population te obtain a corrected deathrate is arbitrary, but if female 
deathrates are to be compared with male, we must take a common standard for 
both, and this is why the combined male and female age groups of 1901 were taken 
as the standard population classes. 


A word must be said as to the manner of correcting the crude deathrate for 
age groups. Actually we only know the age groups accurately for the census 
years; the age groups for intervening years are more or less guesses not beyond 
suspicion. Accordingly it was considered best to obtain the corrective factors 
for the crude deathrates in each census year, then to plot these to the year, draw 
a continuous curve through the plotted points by aid of a spline and read off from 
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this curve the corrective factors for non-census years. This was done on a large 
diagram of which IV is a small reproduction. 


A noteworthy feature of these corrective factor curves is the inflexion which 
occurs between 1861 and 1871. There are two tests of its reality. First the 
diabetes curve confirms the cancer curve and secondly as far as the data permit 
an approximative factor was obtained for the cancer corrective factor curve for 
the census of 1851; it is seen to substantiate the general sweep of the corrective 
factor curve and confirms the inflexion between 1861 and 1871. Diagram V 
shows that the inflexion also occurs in the age distribution of the population as 
exhibited in parentage of males under 45. 


In dealing with the years 1875 to 1880, the few deaths separately recorded 
under Melanosis, Fungus Haematodes and Sweep’s Cancer—now entitled simply 
cancer—were included under the total deaths from cancer. 


The following table showing the work for the year 1911 indicates the manner 
in which the census year corrective factors were found. The crude deathrates 


per million of the male population in this year were: Diabetes 110-96 and 
Cancer 893-57. 


TABLE I. Corrective Factors for Census Year 1911. 






































| 
| | Corrected Deathrates 
. Death- 
iiss Dia- Census Deathrate | rate per Standard or oe 
Age Deaths | betes | Population a | million, | Population, : 
Groups | (yates) | Deaths | (Males), million, | Diabetes| 1901, per | Cancer | Diabetes 
(Males) | (Males)| 1911 | @ancer=r, | ah million A | _ 1,4 _1%3A 
. ~Jo | “1 
| 

Under 5 64 11 1,936,113 | 33-1 5-7 | 114,262 3-78 65 
- 5—10 51 22 1,847,295 | 27-6 11-9 | 107,209 2-96 1-27 
10—15 39 36 1,747,631 | 22:3 206 | 102,735 2-29 2-12 
15—20 60 67 | 1,654,895 | 36:3 40-5 99,796 3-62 4-04 
20-25 | 95 | 71 | 1,502,652 | 632 | 473 | 95,946 | 606 | 4-54 
25—35 317 157 2,831,655 | 111-9 55-4 | 161,580 | 18-08 8-95 
35—45 978 183 2,336,508 418-6 78-3 | 122,848 | 51-42 9-62 
45—55 2901 252 1,694,333 | 1712-2 | 148-7 | 989,222 152-77 13-27 
55—65 | 4627 | 461 | 1,085,156 | 4263-9 | 424-8 | 59,741 | 254-73 | 25-38 
65—75 4602 512 602,764 7634-8 | 849-4 | 33,080 252-56 28-10 
75—85 1687 155 183,869 9175-0 | 843-0 12,090 | 110-93 10-19 
85— 168 9 22,737 7388-8 | 395°8 | 1,491 11-02 59 

| | | 

pe | | 
All ages | 15,589 | 1936 ae | 893-57 | 110-96 eonemaaad 870-22 | 108-72 

| 














Accordingly for 1911 we have corrective factors for: 
Cancer = , F191; = 870-22/893-57 = -97386. 
Diabetes = 2Fj9;, = 108-72/110-96 = -97981. 
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The accompanying Table II gives under each year the total male population, 
the total deaths from cancer and from diabetes, the crude deathrates, the corrective 
factors and the corrected deathrates. From the latter the first six differences, 
A, to A,, were found and these were correlated for the two diseases. The following 
correlations were found: 


Correlation of Differences of Corrected Deathrates. 


Corrected Deathrates + -958 + -008, 
First Differences + 058 + -113, 
Second _,, + 043 + -130, 
Third ee + -047 + -143, 
Fourth ,, + 051 + -154, 
Fifth ya + 050+ -164, 
Sixth Pe + -045 + ‘173. 


It will be seen that with the first differences we get an enormous drop in the 
correlation of the cancer and diabetes deathrates, i.e. from + -958 to + -058, and 
that very rapidly the correlation becomes steady* and insignificant. Thus with 
the removal of the time factor there appears to be no organic relationship between 
the prevalence of diabetes and cancer. A yearin which there is an excess of diabetes 
deaths is not a year which will probably have an excess of cancer deaths. 


Now this result although absolutely conclusive as far as it goes, must not be 
stretched beyond its exact limitation. There is nothing to show that an increase 
of cancer at any given epoch will be accompanied by an increase of diabetes, 
It might be argued that it is conceivable that an increase of diabetes will be 
followed at an interval by an increase of cancer. I have not directly tested this, 
but it is to some extent indirectly tested by the method of variate differences. 
The sixth difference correlation correlates functions of the deathrates for seven 
years in the case of two diseases, and would be likely to indicate if there were any 
such related succession. At the same time'it must be remarked that previous 
investigators have all dealt with the contemporaneous deathrates of cancer and 


* This steadiness can be illustrated alsgg by the method discussed in Biometrika, Vol. x. p. 272 and 
illustrated on p. 346. 


. Values of 07 ,m ,/6%.m-1, and their approach to 4-5. 























| m Theoretical Ratio =4 — 2 Cancer Diabetes Mean 
1 2-000 0°655 0°830 0°742 
2 3°000 2°589 2°969 2°779 
| 3 3°333 3431 3386 3°409 
4 3°500 3°446 3°549 3°497 
5 3°600 3°719 3°627 3°673 
6 3°667 3°653 3°669 3°661 





It will be seen that there is a rapid approach to the theoretical values of the ratio. 
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TABLE II. Cancer and Diabetes corrected Deathrates and Corrective Factors 
England and Wales 1860-1913. 


























og g 
a a} =] be = & £ 
b) ap | a) (8b dH 
» 36. 3 5 3 ot} 

ee | 8 &\°8 
1860 9,704,394 2100 | 216-4 1-0544 
1861 9,801,152 2180 | 222-4 1-0523 
1862 9,923,272 2256 | 227-3 1-0498 
1863 | 10,046,909 2311 | 230-0 1-0480 
1864 | 10,172,089 2459 | 242.7 1-0465 
1865 | 10,298,826 2389 | 232-0 1-0452 
1866 | 10,427,146 2532 | 242-8 1-0444 
1867 | 10,557,066 2650 | 251-0 1-0439 
1868 | 10,688,600 2743 | 256-6 1-0440 
1869 | 10,821,775 2933 | 271-0 1-:0446 
1870 | 10,956,608 2971 | 271-2 1-0460 
1871 | 11,092,620 3060 | 275-9 1-0478 
1872 | 11,242,495 3228 | 287-1 1-0506 
1873 | 11,394,394 3337 | 292-9 1-0540 
1874 | 11,548,346 3470 | 300-5 1-0575 
1875 | 11,704,378 3614 | 308-8 1-0615 
1876 | 11,862,519 3708 | 312-6 1-0650 
1877 | 12,022,796 3950 | 328-5 1-0686 
1878 | 12,185,238 4164 | 341-7 1-0719 
1879 | 12,349,875 4149 | 336-0 1-0747 
1880 | 12,516,737 4423 | 353-4 1-:0770 
1881 | 12,673,435 4611 | 363-8 1-0787 
1882 | 12,808,460 4685 | 365-8 1-0795 
1883 | 12,944,923 4967 | 383-7 1-0800 
1884 | 13,082,837 5346 | 408-6 1-0804 
1885 | 13,222,216 6495 | 415-6 1-0803 
1886 | 13,363,079 5754 | 430-6 1-0798 
1887 | 13,505,441 6262 | 463-7 1-0794 
1888 | 13,649,314 6284 | 460-4 1-0787 
1889 | 13,794,721 6891 | 499-5 1-0777 
1890 | 13,941,671 7137 | 511-9 1-0766 
1891 | 14,092,535 7294 | 517-6 1-0754 
1892 | 14,252,190 7547 | 529-5 1-0740 
1893 | 14,413,657 7908 | 548-6 1-0728 
1894 | 14,576,948 8077 554-1 1-0715 
1895 | 14,742,091 8628 | 585-3 1-0698 
1896 | 14,909,104 9216 | 618-1 1-0682 
1897 | 15,078,010 9573 | 634-9 1-0662 
1898 | 15,248,823 9932 | 651-3 1-0639 
1899 | 15,421,578 | 10337 | 670-3 1-0614 
1900 | 15,596,283 | 10475 | 671-6 1-0583 
1901 | 15,769,412 | 10891 | 690-6 1-0550 
1902 | 15,933,658 | 11098 | 696-5 1-0505 
1903 | 16,099,612 | 11799 | 732-9 1-0453 
1904 | 16,267,291 | 12086 | 743-0 1-0390 
1905 | 16,436,707 | 12470 | 758-7 1-:0318 
1906 | 16,607,890 | 13257 798-2 1-0238 
1907 | 167780,848 | 13199 | 786-6 1-0150 
1908 | 16,955,609 | 13901 | 819-8 1-0056 
1909 | 17,132,182 | 14263 | 832-5 0-9953 
1910 | 17,310,586 | 14843 | 857-5 0-9846 
1911 | 17,490,847 | 15589 893-6 0-9739 
1912 | 17,672,985 | 16188 | 916-0 0-9620 
1913 | 17,857,014 | 16918 | 947-4 | 0-9580 











9) 
bo 
PS 
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: | © 

Ei sh8 | 228 
2.2 2s | 33 

s3 | 533 | 3&2 
Bg | Ag | AS 

346 35-7 | 1-0542 
363 37-0 | 1-0541 
382 38-5 | 1-0537 
359 35:7 | 1-0536 
454 44-6 | 1-0538 
430 41-8 | 1-0541 
437 41-9 | 1-0545 
434 41-1 | 1-0551 
445 41-6 | 1-0559 
497 45-9 | 1-0569 
470 42-9 | 1-0580 
541 48-8 | 1-0593 
513 45-6 | 1-0605 
550 48-3 | 1-0623 
528 45-7 | 1-0642 
614 52-5 | 1-0662 
582 49-1 | 1-0683 
663 55-1 | 1-0703 
669 54-9 | 1-0722 
645 52-2 | 1-0738 
624 49-9 | 10752 
769 60-7 | 1-0762 
733 57-2 | 1-0765 
838 64:7 | 1-0764 
886 67-7 | 1-0759 
896 67-8 | 1-0752 
978 73-2 | 1-0740 
1019 75-5 | 1-0724 
1070 78-4 | 1-0714 
980 71-0 | 1-0697 
1076 77-2 | 1:0679 
1082 76-8 | 1-0662 
1142 80-1 | 1-0642 
1169 8l-1 | 1-0622 
1166 80-0 | 1-0603 
1262 85:6 | 1-0584 
1243 83-4 | 1-0560 
1303 86-4 | 1-0536 
1437 | 94-2 | 1-0510 
1448 | 93-9 | 1-0480 
1455 | 93-3 | 1-0445 
1565 | 99-2 | 1-0408 
1462 | 91-8 | 1-0368 
1409 | 87-5 | 1-0322 
1612 99-1 | 1-0270 
1708 | 103-9 | 1-0214 
1736 | 104-5 | 1-0153 
1738 | 103-6 | 1-0070 
1842 | 108-6 | 1-0022 
1898 | 1108 -9948 
1982 | 1145 -9873 
1936 | 111-0 -9798 
2064 | 1168 -9716 
2122 | 118-8 | -9637 


























200 Correlation of Cancer and Diabetes Deathrates 


diabetes. Of course such an investigation as the present does not provide any 
measure of whether persons contracting diabetes are more liable to die of cancer, 
for the deaths would be registered as deaths from cancer, yet this not improbably 
is the vital problem*. The result reached, however, is consistent with a continuous 
increase of both cancer and diabetes not essentially related organically with each 
other. In this case Maynard’s results for the United States would mean that 
in that huge population the towns and states were in heterogeneous stages of 
historical development, whether as to medical training, accuracy of record or 
cultural conditions, while the non-appearance of the high correlation in the Swiss 
data would merely signify that the units chosen there were homogeneous in such 
characters, and this considering the size of Switzerland is not improbable. In 
other words some American districts present with regard to cancer and diabetes 
the English conditions of 1870 and others the English conditions of 1910. The 
time correlation of the English deathrates may thus correspond to the geographical 
correlation of the American deathrates. If so, since, when the time factor is 
removed, the English diabetes and cancer deathrates show no organic correlation, 
we cannot use the English material to support the view that the American relation- 
ship is organic in character. The correlations that Maynard has indicated between 
both cancer and diabetes deathrates and the spread of insanity, suicides and the 
newspaper press, seem indeed to indicate not an organic relationship of the two 
diseases, but as Maynard himself has suggested a wider range of both with 
advancing cultural conditions. 


* The enquiry as to whether cancer patients are suffering or have suffered from diabetes should 
always be made and the answer recorded. 














A CONTRIBUTION TO THE PROBLEM OF 
HOMOTYPOSIS 


DATA FROM THE LEGUME CERCIS CANADENSIS 
By J. ARTHUR HARRIS, Pu.D., Carnegie Institution of Washington, U.S.A. 


I. IntTRopuctory REMARKS. 


In 1905, I began the collection of series of data for a comprehensive study 
of the problem of intra-individual or “homotypic”* correlation for fertility and 
fecundity characters in plants. Such characters were chosen not merely because 
of particular interest in that subject in general but because of the conviction 
that the intra-class and inter-classt correlation methods applied to the problem 
of fertility and fecundity might yield results of considerable physiological interest. 
Since that time, circumstances have prevented my carrying out the work along 
the lines originally laid down. The constants given here are drawn from notes of 
the work then done. 


II. MatTeERIALs. 


The data presented are exclusively those for pod length J, number of ovules 
formed o, number of seeds developing s, and number of ovules failing to develop 
f, in the small arborescent legume, Cercis Canadensis. Three collections are 
involved: one from Meramec Highlands, near St Louis, Mo., made in the autumn 
of 1905; one from the vicinity of Lawrence, Kansas, taken in 1905; one from 
the neighbourhood of Sharpsburg, Athens County, Ohio, gathered in 1908. These 
materials have already been considered in dealing with other problems quite 
distinct. from the present one; the reader seeking more comprehensive information 
with regard to the materials used should consult the papers cited below f. 


For the Meramec Highlands collections two sets of symmetrical homotypic 
correlation tables were prepared. 


* K. Pearson, and others: Phil. Trans. B, Vol. oxcvu. pp. 286-288, 1901. 
+ Biometrika, Vol. 1x. pp. 446-472, 1913. 
t Bot. Gaz. Vol. u. pp. 117-127, 1910; Bull. Torr. Bot. Club, Vol. x11. pp. 243-256, 1914 
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I used the first 26 pods taken from 112 trees. The desirability of working 
with a larger series of data from individual trees then appealed to me and a table 
comprising the first 100 pods from the 60 trees from which that number could 
be secured was undertaken. For both of these series all the possible direct and 
cross homotypic correlations were ascertained for the fertility characters. 


The Kansas series comprised only 22 individual trees, from each of which 
100 pods were counted. The Ohio series included 150 pods each from 26 trees. 
For these two series all possible homotypic correlations for the fertility characters 
have been determined. 


III. Mertuops. 


The three direct intra-individual or homotypic correlations are: 
(1) Ovules of first pod and ovules of second pod, Tables I, II*. 


(2) Seed developing in first pod and seed developing in second pod, Tables 
III, IV. 


(3) Ovules failing in first pod and ovules failing in second pod, Tables V, VI. 
The cross-homotypic relationships are : 

(4) Ovules of first pod and seeds developing in second pod, Tables VII, VIII. 
(5) Ovules of first pod and ovules failing in second pod. 

(6) Seeds developing in first pod and ovules failing in second pod. 

Tables for two of the cross correlations are unnecessary. Thus for the relation- 


ship between the number of ovules in the first pod and the number of ovules failing 


in the second pod, r,,;,, the slope of the regression line is clearly 


Ofe Fo, Ogq 
Torfe— = Torog — — Tose — » 
P ‘ Go, Fo, Go, 
or in terms of correlation 
Fo Os : 
Torte Toren —" — Voyeg — * te teeececcccvescccccccoces (i) 
OF, Of, 


Again, for number of seeds in the first pod and number of ovules failing to develop 
in the second pod the slope of the regression straight line is 


OF, Foe Og 
Tete les — ue 9 
31 Fs, sy 
Oo, Oso me 
whence Pict Gig” Pag ee. one vesersceesenepesustenied (il) 
Of, OF, 


In calculating the correlations for length of pod in the Meramec Highlands 
series and for the fertility of characters of the Ohio and Kansas series I have had 
recourse to the direct and cross intra-class correlation formulae (v)-(xii) of 


* The numbers refer to the tables for the Meramec Highlands series. 
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Biometrika, Vol. 1x. pp. 450-452, 1913. In present notation the product summa- 
tions are for the several correlations: 


For foe, 12 (OP — S[ Tle HHIN, ............0..0.ceceeeee. (iii) 
—— Sle | en (iv) 
10 Cgagign OE  — BF neces vesnecnncecvenvensacs (v) 
» Tors» IS [ZX (0’) X (8’)] — S[X (0's’)P/N,  .........-eeceeeee (vi) 
1» Vase SIBWVESH — BIE OT WM,..w:..---0-0000-02 (vii 
0 Gates CLS CDSS) — SLE OT BI 022 <0sccvsceersees (viii) 


where N is the population of individuals resulting from the n(n — 1)-fold 
weighting, X denotes a summation for the individual pods of a class and S a 
summation for the classes (individual trees). For data see Table IX. 


In the first three of these the negative sign term of the formula is merely the 
second moment of the unweighted population and in the fourth it is the product 
moment of the correlation surface r,,—the “organic” relationship between 
ovules and seeds of the same »od—from which the three second moments, = (0’2), 
x (s’*), B(f’*), may be calculated. The 5th and 6th cross correlations may be 
determined from the other correlations by (i) and (ii), or since f=o—<s the 
minus terms in (vii) and (viii) are given by 

8[E (of')] = SE (0')] — S[E (0's’)}, 
8 [Z (s'f’)] = S[% (s’0’)] — S [2% (s)], 


which may be most easily calculated.from the published tables for r,,, since all 
classes are equally large. 


IV. PRESENTATION OF Data. 


The constants for both direct and cross homotypic correlations are shown for 
all series in Table A. 


The reader, in examining these constants, will remember that the two Meramec 
Highlands series were taken from the same habitat aud in the same year. They 
are known to differ only in (i}+the smaller number of pods from each individual 
in the first series, (ii) the fact that the first series contains pods from 50 individuals 
which do not occur in the second lot. The Ohio and Kansas collections on the 
other hand represent materials of the same species from localities severe] hundreds 
of miles distant. Unfortunately, both of these contain too few individuals to be 
fully trustworthy. 


There can be no reasonable question of the statistical trustworthiness of all 
the direct homotypic correlations. The low st is that for ovules failing per 
pod and this is in all cases six or more times its probable error. The coefficients 
To,029 Ts,9, ate from 25 to 45 times as large as their probable errors. For the most 
part the cross correlations may also be regarded as statistically trustworthy. 
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TABLE A. 
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Homotypic Correlations* for Cercis. 

























| 

















Meramec Meramec Sharpsburg, Lawrence, | 
Combination Highlands, Mo., Highlands, Mo., Ohio, Kansas, 
112 Trees 60 Trees 26 Trees 22 Trees 
7 | 
Ovules of First Pod 
and Ovules of Second | 
Pod ... ae lies +3524 + -0109 *3527 + -0076 | 2768 + -0100 | -3999 + -0121 | 
Seeds of First Pod and 
Seeds of Second Pod +2677 + -0116 -2306 + -0082 1798 + -0104 | -2019 + -0138 | 
Ovules Failing in First 
Pod and Ovules Fail- 
ing in Second Pod ‘0979 + -0123 0684 + -0087 1756 + -0104 | -0858 + -0143 
Ovules of First Pod 
and Seeds of Second 
Pod ... ae lie *2758 + -0115 *2622 + -0081 | 1173 + -0106 | -2574 + -0134 
Ovules of First Pod | | | 
and Ovules Failing | 
in Second Pod ...| 0178+ -0125 | +0301 + -0087 | 0762 + -0107 | -1195 + -0142 
Seeds of First Pod | 
and Ovules Failing | | 
in Second Pod... | — 0570+ -0125 | — -0261 + -0087 | - -1095+ -0107 | -0344+- 


0144 










and IV. 


TABLE B. 


Regression Equations for Meramec Highlands Collection. 






















2912 Pods 6000 Pods 
0, = 3-0571 + -35240, 0, = 2-9925 + -35270, 
8 = 29563 + -26773, 8 = 30018 + 2306s, 
fe = *6171 + -0979f, fo = °6722 + -0684f, 
8, = 2-4485 + -33650, 8, = 2-3982 + -32520, 
fe = *6089 + -01590, fe = °5943 + -02750, 
f, = °8671 - -0453s, fe = *7965 - -01928, 

















linear. 


* All probable errors are calculated on the basis of the actual, not the weighted, number of pods. 


The regression equations for the two Meramec Highlands series appear in 
Table B. The regression lines for the direct relationships 7,,,., s,s, ate given 
in Diagrams I and II, and for the cross relationships 1,,,., 75,0, in Diagrams III 


Inspection indicates that fhe regression of the ovules of the second pod 
(Diagrams I and II, upper lines) on the ovules of the first and of the seeds of the 
second pod (Diagrams III and IV, steeper lines) cn the ovules of the first are 
It is not so clear in the case of ovules or seeds of the second pod on seeds 


Ovules of Second Pod. 


Seeds developing in Second Pod. 
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Discram I. Direct Homotyposis. Cercis Canadensis (Meramec Highlands, Ist Series). 
Ovules on Ovules and Seeds on Seeds. 





Tablel 





Ovules of Second Poa. 
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Seeds developing in First Pod. 
Diacram II. Direct Homotyposis. Cercis Canadensis (Meramec Highlands, 2nd Series). 
Ovules on Ovules and Seeds on Seeds. 
6 - 
Table // ee 
5r 
4} 
| | 
’ ae ee a 
Table lV Ovules of first Pod 

5r : 
4F 
Kd 











3 4 5 
. Seeds developing in First Pod. 













Ovules in Second Pod. 


Ovules on Seeds and Seeds on Ovules. 


Draceam III. Cross Homotyposis. Cercis Canadensis (Meramec Highlands, Ist Series). 





Table Vil 





‘ 
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1 . 
% 1 0} 3 4 5 
Seeds developing in First Pod. 
Dracram IV. Cross Homotyposis. Cercis Canadensis (Meramec Highlands, 2nd Series). 
Ovules on Seeds aud Seeds on Ovules, 
Table Vill ; 
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Seeds developing in First Pod. 
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of the first, although even here (see Diagrams I and II, lower lines, III and IV, 
less steep lines) the deviation from linearity is not great*. 

Returning now to the constants and considering them in detail, we note that for 
all localities the direct homotypic correlations are highest for the number of ovules 
per pod and lowest for the number of ovules failingt. The averagest are 


Ovules and Ovules ... ae we pee -3430 
Seeds and Seeds = nee ae Re -2164 
Ovules failing and Ovules failing ... - “1198 


The cross relationships are highest for ovules formed and seeds developing per 
pod; much lower for ovules formed and ovules failing to develop; lowest for 
seeds developing and seeds failing to develop—three of the four values for r,,,, 
are negative in sign. 


The averages are: 


Ovules and Seeds sti an ove + -2168 
Ovules and Ovules failing pe _ + -0712 
Seeds and Ovules failing is on — -0440 


From these data it appears that the individuality of the trees is more marked 
for number of ovules per pod than for either number of seeds matured or number 
of ovules failing per pod. 


Number of ovules per pod may be regarded in comparison with number of 
seeds developing and number of ovules failing to develop as an independent 
variable§. 

The number of seeds per pod is absolutely limited by (i) the number of ovules 
formed in the pod, and (ii) the number of these which are prevented from developing 
by peculiarities inherent in ovum or sperm, by accidents of fertilization, inadequacy 
of food supply, or other unknown causes. Oz conversely, the number of ovules 
failing per pod is dependent upon the number of ovules formed and the number 
of seeds developing. Thus the number of ovules per pod determines within certain 
limits the number of seeds developing, or the number of ovules failing, per pod; 
but in the nature of things it cannot be influenced by either of these. 


Now since the three characters under consideration are correlated, i.e. since 
Tos» Tors Ys¢ have been found to have sensible values, it is clear that some homotypic 
relationship would arise for seeds because of the correlation for ovules, or for 
ovules failing because of the correlation for ovules and seeds. Thus a statistically 
significant value of an intra-individual correlation for seeds per pod does not 

* In the case of the relationship for ovules and ovules in the 60 trees from Meramec Highlands 


I have found the raw value of 7 to be -35336, exceeding r by only -00064. 

t In the Ohio series r, ,, is very low; r, ,, and ry ;, are not sensibly different. 

¢ The constants for the 112 trees from Meramec Highlands are alone used in obtaining the general 
averages. 

§ This does not mean that in any given pod all three of these characters are not dependent upon 
the same ultimate causes, but merely that in a proximate sense number of ovules per pod is physio- 
logically independent while number of seeds and number of ovules failing are physiologically dependent. 
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necessarily mean that there are any specific biological factors influencing the 
number of seeds developing or the number of ovules failing in all of the pods of 
the same individual in a similar manner. The observed values of correlations 
for (semi)dependent characters may be merely the resultant of independent 
variables with which they are correlated. 

To remove the influence of r,,,. upon 75,,, We have recourse to the partial 
correlation coefficient for two variables, i.e. 0, and 0,, constant. In our notation 
this is 

Ts 99 (1 — Tor02 + be To1 8170182 — "0281" 0282 7 Toros (1, o1 8170282 + Tor 82 Tors) oy 


r = _———* 
9109" 81 82 2 
v(1 ‘a faa" fant —? 0131 S+- 2Po.0,7 71 noon) Vv(1 nis To102 i Terns —Voa* + Fores Nats 32702 roses) 











But since we are dealing with symmetrical tables we may write* 
T e's (1 2s To'o’*) ie 2TosTo's’ + To'o' (Tos + To's") 

1 = Poo? — Tog? — Tos? + oro’ osTo's! 
where the correlations without dashes are those for the same pod and those with 
dashes are for the relationships between different pods. 

In applying this formula to the actual data, the values of r,, for the individual 
pods are essential. These have been deduced from Tables X, XI below and 
Tables VI, VII of a former papert. 

Meramec Highlands. 
6000 pods, 7,, = -6855 + -0046, 
2912 pods, 1r,, = -6936 + -0065, 
Sharpsburg, Ohio, 1,, = -4553 + -0086, 
Lawrence, Kansas, r,, = -6032 + -0091. 


The partial correlation coefficients for seeds per pod for constant numbers of 
ovules per pod are: 


= 





oo s's 














o'o'" 8's 

Series Te. | votre | Tee ay 
Meramec aed | | | 

112 Trees -1063 -2677 *397 

60 Trees ... | 0700 | +2306 304 

Sharpsburg, Ohio .. | +1645 ‘1798 | -915 

Lawrence, Kansas | +0582 +2019 | -288 

| | 








These give a mean value of -0998 as compared with the mean relationship 
uncorrected for the influence of ovules per pod, ry, = -2164. 


Apparently, there are specific physiological ¢ factors which tend to differentiate 


* Biometrika, Vol. vu. p. 328. For original see K, Pearson, Phil. Trans. A, Vol. co. p. 31, 1902. 

+ Bull. Torr. Bot. Club, Vol. xt. pp. 243-256, 1914. 

¢ Under the term physiological as used here are included (a) the ecological factors which determine 
whether an ovule shall receive a sperm, (0) the nutritional factors which determine the availability of 
food materials, and other environmental prerequisites for development, and (c) the innate vigour or 
other physiological characters of the individual which determine whether a fertilized ovule shall develon 
into a seed 
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the trees with respect to seed production, but the bulk of the interdependence 
for seeds seems to be merely the resultant of r,. and 1,,*. 

Relative net individuality of the trees with respect to capacity for seed pro- 
duction is highest in the Ohio series. This is due to the fact that both r,,,, and 
9, are lower for this series than for any other. 

The frequency distribution of length of pod measured to the nearest millimetre 
in 50 pods each from 60 trees is shown in Table XII. The summed length for 
the individual trees, i.e. mean length x 50, is given in Table- XIII. By the direct 
intra-class formulae I find 

Tiy1p = *4784 + -0095, 
a homotypic value distinctly higher than those found for the fertility characters. 


VY. Discussion or REsvutts. 


The only homotypic constants available for fertility characters in plants are 
those furnished for several species of Leguminosae by Pearson and his associates f. 
They give the direct correlations as shown in Table C. 

TABLE C. 
Homotypic Correlations for Fertility Characters. 

















| 
Species of Legume Ovules — | Ripe Seeds | 
a= wise 2 eae Tage be | 
| | 
| Cytisus Scoparius — — *4155 
| Lotus Corniculatus — —H +2354 
Pa es — — 1884 
| Lathyrus odoratus 2182 -2679 -0830 
= es +3658 | -1759 -2091 
Lathyrus Sylvestris 1695 | -1376 2184 
” ” a ' — *1877 
| Vicia Faba” =... | -1724 «| «= 1493 |S «1243 
| ye! a ee et | } 
| 
| Vicia Hirsuta  ... 2315 +1827 | 2077 





In addition to these values I have found in a short series of only 12 trees 
of the arborescent legume Robinia Pseud-acacia worked out as an illustration of 
method{ the values 7,,,. = -452, 1,,,, = °449, 7,,. = °383. For a short series 
(23 plants) of Cytisus Scoparius§ I have found r,,,, = -198. 


The average homotypic values for ovules in Cercis are distinctly higher than 
the comparable values for other species hitherto adequately investigated. The 


* In Sanguinaria (Biometrika, Vol. vit. p. 328) the correlation for the number of seeds on the two 
placentae of the same fruit seems to be chiefly due to physiological factors. In this case an organic 
correlation is superimposed upon a homotypic. The point may profitably be discussed comparatively 
when other data now in hand are completely analyzed. 

+ K. Pearson and others, Phi. Trans. A, Vol. cxcvm. pp. 364-379, 1901. 

t Biometrika, Vol. 1x. pp. 456-458, 463, 1913. 

§ Amer. Nai. Vol. xiv. pp. 566-571, 1911. Possibly this material is of closely selected ancestry. 


Biometrika x1 
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mean values for seeds are about the same as those found by English investigators. 
The mean correlation for ovules failing is lower. 

The homotypic correlation for seeds matured per pod is statistically largely 
a resultant of the homotypic correlation for ovules and the “organic” correlation 
for ovules and seeds of the same pod. Thus homotypic correlation for seeds per 
pod does not entirely disappear when correction for r,,,, and 7, is made by the 
application of partial correlation formulae. There are, therefore, (proximately) 
independent ecological and physiological factors tending to differentiate individuals 
with respect to capacity for seed production. 

The general rule for fertility characters in Leguminosae seems to be for the 
maximum value of the homotypic correlation to be that for number of ovules 
per pod, much lower values are found for seeds matured or abortive ovules per 
pod. I believe that the evidence also points in the direction of a lower correlation 
for ovules than for the more truly vegetative characters of the plant. Thus 
Pearson’s average values for r,,,. are far lower than his average for leaves and 
other similar organs. The result given above for length of pod 7;,;, is practically 
-500, and distinctly higher than r,,,, = -352 from the same habitat. 



























































TABLE I. 
Ovules per Pod. 
2 3 sh a 6 7 Totals | 
a= l l 
< 2 is | lll | 197 | 46 3 0 375 | 
ce 3 111 | 836 | 2382 | 942 239 15 4525 | 
So 4 197 | 2382 | 11770 | 8642 2599 160 | 25750 
5 46 | 942 8642 12134 5613 598 | 27975 
o 6 3 239 2599 | 5613-| 3742 529 | 12725 | 
3 7 0 15 | 160 598 529 148 1450 
5S | | | 
| | | 
Totals 375 | 4525 | 25750 | 27975 | 12725 | 1450 | 72800 | 
| | 
TABLE II. 
Ovules per Pod. 
| 
2] 3 4 5 6 7 | 8 | Totals | 
| 
: 
3S 2 98 | 1025, 1775 419| 49 0 Of 3366 
= 3 1025 | 10608 | 25134 11257 | 2300' 111 5] 50490 
= 4 1775 | 25184 | 101272 | 74059 | 18242 | 1296 | 31 | 221859 
S. 5 419 | 11257 | 74059 | 92240 | 37849 | 4012 142 | 219978 | 
Ke 6 49 | 2300, 18242 | 37849 | 25458 3833 181 | 87912 
2 7 o| il 1296 | 4012 | 3833 | 712| 35] 9999 | 
5 0 5 31 142| 181| 35! 2 396 
os) | | 
l 
Totals }| 3366 | 50490 | 221859 | 219978 | 87912 | 9999 | 396 | 594000 
| | 
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TABLE III. 
Seeds per Pod. 
21.8 3 4 5 | 6 7 | Totals 
| ei 2 16; 9 | 179| 207| 94; 9 of 600 
ay 2 95 | 710| 1596 | 1691| 693) 152) 13] 4950 
oe 3 179 | 1596 | 5106 | 6260} 2975 | 747| 62] 16925 
2 4 207 | 1691 | 6260 | 10164 | 6106 1821 | 151 | 26400 
a 5 94 | 693 | 2975 6106 | 4986 | 2109 | 212 | 17175 
< 6 9| 152 | 747| 1821 | 2109 | 1096 | 166] 6100 
3 7 0; 13! 62| 151] 212] 166; 46] 650 
‘ | 
Totals | 600 | 4950 | 16925 | 26400 | 17175 | 6100 | 650 | 72800 
i 
| TABLE IV. 
Seeds per Pod. 
| | Fl 
1 2 3 4| 6 6 7 | 8 | Totals 
| | 
3 1 394} 1467| 2965; 2861| 1412; 283| 23/ Of 9405 
a 2 1467 | 7224 | 17224 16294| 7546/| 1520] 104| 2] 51381 
re 3 2965 | 17224 | 47594 52375 | 27161 | 6032| 388| 8 | 153747 
2 4 2861 | 16294 | 52375 | 71576 | 45505 | 12278 | 1041 | 30 | 201960 
5 1412 | 7546 | 27161 | 45505 | 36836 | 13121 | 1304 | 72 } 132957 
3 6 283 | 1520} 6032  12278| 13121 6324! 853/| 70] 40491 
g 7 23| 104; 388 1041! 1304) 853 | 134] 14] 3861 
R 8 0 | 2 | 8 30 | 72; 70] 14] 2 198 
| 
Totals | 9405 51381 | 153747 | 201960 | 132957 | 40491 | 3861 | 198 | 594000 
TABLE V. 
Ovules failing per Pod. 
: | 
3 0 iets | 4 | 5 | Totals 
pu | | | 
o | 
a 0 19140 | 12531 | 3620 | 756 91 | 12 | 36150 
oo 1 12531 | 9536 | 3182 692 | 97 | 12 | 26050 
A= 2 3620 3182 | 1206 | 313 | 53) 1] 8375 
- 3 756 692/ 313 134] 30) OF 1925 
ca 4 91 97| 53| 30] 4! OF 275 
mn 5 12 > oe 0; 0: 0 25 
Be | 
2 
io | | 
© | Totals | 36150 | 26050 8375 | 1925 | 275 
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TABLE VI. 
Ovules Failing per Pod. 
3 0 49 3 | 4 | 5 | Totals 
av 
by 
A} 0 143232 | 101090 | 32493 | 6712 | 1408 | 185 | 285120 
mes ee 101090 | 78862 | 27035 | 6332 | 1227 | 185 | 214731 
a 2 32493 | 27035 | 10220| 2735 596| 82] 73161 
= 3 6712 | 6332 | 2735| 884 | 233/| 33] 16929 
4 1408 | 1227| 596] 233/ 92| 8] 3564 
a. 5 185 185 | 82 33 s| 8 495 
o | | 
2 i 
° | Totals | 285120 | 214731 | 73161 | 16929 | 3564 | 495 | 594000 | 
| 
TABLE VII. 
Seeds per Pod. 
) 
a ae | 3 f 5 6 7 | Totals 
3 | | | 
a} 2 7; 86| 154) 110/ 18 0| Of 375 
w| 3 57 | 606 | 1726| 1568| 471) 92) 5] 4525 
Zi 4 261 | 2213 | 7443 | 10099 | 4660 | 1003 | 71 | 25750 
ns 5 206 | 1553 | 5511 | 10150 | 7592 | 2697 | 266 | 27975 
3s 6 7| 460 | 1931 | 4104 | 3953 | 1975 | 235 | 12725 
pi 7 2| 32| 160} 369) 481) 333) 73] 41450 
° | _ 
| Totals | 600 4950 | 16925 | 26400 | 17175 | 6100 | 650 } 72800 
TABLE VIII. 
Seeds per Pod. 
] ] ] 
1 | ¢') @ 4 5 | 6 | 7 | 8 | Totals 
} | 
| | : | | 
f| 2 98| 658} 1407 953 234 16 0} O]} 3366 
| 3 1158 | 7868 | 19077 15761| 5710 882 | 34 | 0} 50490 
ry 4 4206 | 23467 | 67408 | 78277 40285  7727| 485| 4] 221859 
a} 65 3066 | 15160 | 49891 | 76785 | 56232 17265 | 1525 | 54 | 219978 
2| 6 800 | 3923 | 14687 27312 27097 | 12491 | 1492 | 110] 87912 
| | 7 72| 294) 1238) 2793 3260, 2005 | 309| 28] 9999 
5 8 5| ll 39 79 139 105) 16] 2 396 
© | | 
| Totals | 9405 | 51381 | 153747 | 201960 | 132957 40491 3861 198 | 594000 
| | 
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TABLE IX. 
| Kansas Series Ohio Series | 
] : a 
Total | Total Total Total 
Tree | Ovules Seeds Ovules Seeds 
| 
Z(o') | Be’) =(0’) = (s’) 
' 
1 522 461 740 592 
2 471 377 690 578 
3 420 362 861 731 
4 473 412 791 605 
5 400 351 813 576 
6 480 413 831 504 
7 456 368 826 477 
8 | 591 469 809 519 
9 454 346 753 329 
10 574 491 843 546 
ll 429 380 815 490 
12 487 382 879 633 
13 482 407 762 606 
14 552 465 787 453 
15 502 418 769 566 
16 473 411 863 632 
17 396 338 746 641 
18 | 564 408 1039 731 
19 614 521 707 510 
20 | 495 437 962 731 
21 | 536 435 886 696 
22 444 403 788 649 
23 — — 806 692 
24 } —_ — 800 632 
25 —_ — 925 651 
26 — — 932 612 
TABLE X. 


Seeds per Pod. 








Ovules per Pod. 








| | | ! 

1 2 3 4 | 6 6 | 7 Totals | 

| | | 

| 

2 1 “ui — — Fw a 15 | 
3 11 50 | 120 — jf — | — — 181 
4 8 89 | 380 553 | — | — — | 1030 
5 3 | 38 | 147 ai ime; — — | 1119 
6 ;)}- 37 8 so | 169 | 293 | — 509 
7 — — | 1] 2/ 8] 2 | 2 58 

| | 

| Totals | 24 198 | 677 | 1056 | 687 | 244 , 26 | 2912 
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TABLE XI. 
Seeds per Pod. 
1 | 2 | 3 | | 6 6 | 7 | 8 | Totals 
| | | 
S| 2 tn 1 es oe. ee Gy or 34 
= 3 34 | 176 300 -_ inte ce | ee | ee 510 
i 4 37 | 226 885 | 1093 = — | a | = | oe 
a} 6 17 | 71 310 ™ |e] —i—|—* ae 
“ 6 4] 14 54 153 os | 35 | —| — 888 
2 7 ~ 1 4 9 15 33 | 39 | — 101 
e 8 — | — — 1| — lj — 2 4 
es) 
Totals | 95 , 519 | 1553 2040 | 1343 | 409 | 39 | 2 6000 
TABLE XII. 
Length of Individual Pods. 
rs oe ee 
— —|—- Pie ae ua 
45 2] 55 | 13 | 66 141 75 | 2138 | 85 | 53 9% | — 
| 46 1 | 56 | 12 | 66 83 | 76 | 152 | 86 | 28 96 2 | 
| 47 1 | 57 | 19 | 67 , 104 | 77 | 122 | 87 | 26 97 3 | 
| 48 — | 58 | 19 | 68 | 128 | 78 | 117 | 88 | 18 98 3 
| 49 1] 59 | 13 | 69 56 | 79 | 53 | 89 3 99 ¢ 
| 50 1 | 60 | 54 | 70 | 222 | 80 | 187 | 90 | 21 | 100 3 | 
| 51 2] 61 | 39 | 71 | 113 | 81 s2./ 91 | 7 | 101 3 
52 3 | 62 | 54 | 72 | 162 | 82 86 | 92 | 11 | 102 1 
53 4 | 63 | 51 | 73 154 | 83 64 | 93 sim | — | 
54 5 | 64 | 63 | 74 | 158 | 84 | 56 | 94 104 2 | 
TABLE XIII. 
Total Lengths of 50 Pods for 60 Individual Cercis Trees. 
3345 | 3361 | 3587 | 3504 | 3499 | gsgs | 3730 | a4is | seas | 3202 
3679 | 3794 3352 | 3707 | 3615 | B463 | 3945 | 3388 | 4026 | 3642 
3930 | 4546 3654 | 3658 | 3823 | 3484 | 3806 | 3977 | 3244 | 3459 
3548 | 3748 3343 3686 | 3619 | 3978 | 3725 | 3544 | 3223 | 4086 
3693 | 3725 | 3168 | 3873 | 3889 | 3279 | 3502 | 3053 | 3765 | 4031 
3883 | 3850 3770 3921 | 3497 | 3791 | 3749 | 3603 | 4117 | 3590 
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ON THE PROBABLE ERROR OF A COEFFICIENT 
OF CONTINGENCY WITHOUT APPROXIMATION. 


By ANDREW W. YOUNG, M.A. anp KARL PEARSON, F.R.S. 


(1) Introductory. 


There have been two memoirs dealing with the probable error of a coefficient 
of contingency, namely that by Blakeman and Pearson in 1906* and that by 
Pearson in 1914}. In the former paper. the authors started from the expression 
for the mean square contingency 

mn? 4 
al "*) 

and varied n,,, n,, and n,, but neglected the squares and products of these varia- 
tions. The result was lengthy, and the arithmetical work laborious. In 1914 
Pearson gave reasons for considering x,, and n,, as constant during the sampling 
and got a much simpler value for o,.. The result in actual numerical cases did 
not differ widely from the much more elaborate formula of the earlier memoir. 
Recent work in other directions has, however, shown that caution must be used 
in neglecting the square and product terms of the variations due to random sampling, 
and the object of the present paper is to consider the variation of ¢? on the hypothesis 
of Pearson’s 1914 note but without approximation. 

Let a population of size M be grouped into c divisions—for example, the cells 
of a contingency table—ari let the contents of the sth division be m,. Let a 
sample of size N be taken at random from the population and let n, be the contents 
of the sth division according to the same grouping. 

We shall here consider the variation of the quantity 4? defined by 


1+¢?=S (Fx) jiatoucthiteea need (i), 


where A, is a number connected with the sth division and is for the present restricted 
only by the condition 


PURI scistieiinninnaenimmenl (ii), 
—a condition which enables ts to write 
2 7 {(%s —A,) } 
¢?=S8S {e ene (iii) 


as equivalent to (i). 


* Biometrika, Vol. v. p. 191. } Biometrika, Vol. x. p. 570. 
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These undetermined numbers A, are thus in general of the nature of weights and 
may be chosen in a variety of ways. The most important particular case is that 
of the population being grouped in a contingency table with, say, two variates. 
MyM, 
_— 
and m, being as usual the marginal totals of the uth row and vth column of the 
population M, ¢* will be the mean square contingency*. Other cases will be 
_ discussed later. 

The object of the present paper is to investigate the variation of the quantity 
¢? as determined from the samples of the population. We take the numbers A 
to be a property of the whole population and accordingly to have no variation as 
long as the size N of the samples is constant. It is true that in most cases in practice 
there will be only one sample and that the values of the numbers A will have to 
be deduced from that sample and will therefore deviate from the values which 
would be used if the sampled population were known. But what we are seeking 
is the variability of the samples on the understanding that the distribution of 
the whole population is definite although in practice we know only the approxima- 
tion to that distribution which is given by our sample. If we had wanted the 
variability of the calculated values of ¢? deduced from a large number of random 
samples, then we should have taken into account the variation of the A’s as well 
as of the n’s. In this lies the difference between the discussions in the two earlier 
papers of 1906 and 1914. 


This investigation follows that of the second paper, but we shall here give the 
full expressions without approximation, i:e. without neglecting the square of 5n, 
as was done in 1914. It will appear from the numerical examples worked out 
later that this squared term makes a fairly great difference and, even if this were 
not so, it is always preferable to have such formulae in full in order to decide the 
legitimacy of neglecting any terms. This is especially the case in statistical theory 
where neglect of the later terms of a Taylor expansion often leads to false results. 


The sth division will be, say, the cell (u, v) and if A, be taken to be N 


(2) Mean Value of ¢?. 
Let ¢? be the mean value of ¢? and let 7. be the mean value of n,, i.e. the 
value which would be given by taking a very large number of samples. Then we 
can write 


N, Mz 
NH 
Also if we define 54? and 8n, by the equations 
d? = g? + 842, 
nm, =, + 5n,, 
we have 


= n,? n,on (5n,)? , 
14+ ¢+8 *=8(Hr) 28 | 8 (T) Ee iv). 
+o +9 W),) + 8 yn, ) + * ma, (iv) 
* Drapers’ Research Memoirs, Biometric Series, 1. On the Theory of Contingency, etc., Cambridge 
University Press, 1904; Biometrika, Vol. v. p. 191, 1906; Biometrika, Vol. x. p. 570, 1914. 
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Sum all such equations for a large number of samples and divide by the number 
of samples. Then, since Mean 5n,=0 and Mean 5¢?= 0, we have 








1+¢=S ( nx) + Mean S (ee) ienevetnatsiutal (v). 


The expression Mean (5n,)* is typical of several which we have to use in what 
follows and it will be useful to state or prove all the needful formulae before 
proceeding further. 


(3) Formulae regarding Products of Deviations. 


The deviations 5n, arrange themselves according to a hypergeometrical series 
and the moment coefficients of this series are known to be* 


Be = xi N pq 
Hs = X1X2N pq (p — 9) 
Ha = x1 N pq (3xsN pq + xa) 
N-1 
where m=1- 5 
2 (N —1) , duenete’ (vi), 





xs=(1 - 5) {1-3 Gaeta) 


“= 1-69 —5(1 wo) 





When M is very large as compared w:th N, as in the majority of cases in 
practice, we may write x; = x.=X3=X%,= Ll. 


We can now make use of these formulae to derive the following: 


(a) Mean (8n,)?. This is », in the notation of (vi) and 


$8 (8 = FB hstrcremiareonincniti (a). 


(b) Mean 5n,5n, where s and s’ differ. Suppose first that 5», remains censtant 
and investigate the mean of 5”, for this constant value of in,. Now the distribution 


* Pearson, Phil. Mag. 1899, p. 239; Biometrika, Vol. v. p. 174. 
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of ny for n, constant is clearly given by the sub-hypergeometrical series -with 
N’' = N — ii, — 5n, as total population and 











—_— _ hy 
Pp N Cate fi,’ 
q =1- wr i,’ 
so that the Mean dn, for 5n, constant is 
F ny ene On,jiy * 
(N — 8, — &n,) 7g — te N -ii, ° 
Hence 
Mean 8n,5n, = — Mean 8n,. we 
= — Mean 8n,? ‘NV me 
i,\ fig 
- — aii, (1 —W) Waa, from (a) 
Ty 
XL ceeeeseeseeeeeeeseeseeecesesesseeeeseseeeeens (b) 


(c) Mean (8n,)°. Directly from (vi) and (vii) 


Mean 8n,? = x1Xx2 7s (1 - ") (1 ~ *) siiccinenninicauensl (c). 


(d) Mean (8n,)?5n,. Using the process of double summation as in (b) we 
have 


Mean (5n,)? 5ny = — Mean (8n,)? wee 
=— ne ti Nis (1 - x) (1 - *) from (ce) 
- — xaxa “! (1 - 5) Rata Sool ae ee (d). 
(e) Mean (8n,)*. As in (c) we have immediately 
Mean (8n,)* = xii, (1 P= =) ) (Sxa%, (1 -*) + Xs) betas (e). 


(f) Mean (8n,)? (6ny)?. We again use the double summation as in (6), but in 
this case the algebra is much more troublesome. From the constants of the 
sub-hypergeometrical as given in (b) we have 

TN? ml nt N -- fi, — 5n, Ny iy 
4 N'p'g = (1- W-a,-i av = A, — Bm) ee (1- We): 


* Since S(5n,) must be zero, we can regard the Mean dn, for a given dn, as being the result of a distri- 
bution of a deviate — dn, distributed over all the divisions except the sth. The portion due to the 





sth is then Won x (=8n,), a8 obtained above. 
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But this is the Mean (8’n,)?, where 5’n, is measured from the mean of ny in the 
case where 8n, is fixed and we must reduce to the general mean, i.e. where 5n, is 
not given, to obtain the mean value of (5n,)* for constant 5n,. This is done by 


: : iy Si 
adding the square of the difference between these means of ny, namely aes 


* N—-it,” 
We thus obtain that the Mean (8n,)? for constant dn, 
Fg Me ae = ji. ; 2 2 
a (1 _N—ii,—6n, ') (WN — 5, — ta) fig (1 is “) + ny? (Sn,) 











M -—-m,—1 N — it, N-i (v _ ay 
~ (a ¥)- (1) i — gin) wen a 
Lar ( i) ( ¥ M (NV “T, is) M? (N iol n,)* 


M 
ref aow 3. 
N (¥ (N — ii) 1) 
He? (6n,)? 
» (N — ii,)?’ 
when we substitute Mi,/N for m,. 


Thus we have to evaluate 


=f 2N N Mean (8n,)* 
Mean (5n,)*(5n,’) u(i- M) Mean (5n,)? — (i- M Dm M N-i, 





N2 Mean (8n,) 4 M?i, (N — ii, — fig) 1 
(N M a ora 
N?(y (N — ii.) — 1) 


~ M? (N — ii,)? 
Substituting from (a), (c) and (e), we find 


= (N ae aa 
Mean (8n,)? (8n,)? = x1 ote tu (1 : im) a amu (N — ii, — i,) 
N* (5 (W — i,) — 1) 
N 
. ( “) M (N — 2i,)(N —ii, — iiy) . 
i — %Xs 
M/s (a (N — ii,) — 1) N(x N —ii,) — 1) 


N — ii, — iy Ny i 


+ 3x3 -F + Xa =f 
(V-a)((N-ny-1) 0 SNM 


This expression must be symmetrical iti s and s’ and this will be the case only 


Mean (8n,)*. 











— Xa 


——. oe : ; 
if the quantities V (N —i,)— 1 and N —i%, in the denominator cancel with 


factors in the numerator. By taking the two terms in x, together we get rid of 
the N — ii, factor and after a laborious expansion in substituting the values of the 
x’s we reduce the whole expression to the comparatively simple form 
i.e + ig Bile ) : 
: v(1i— 4 ia) did ( 
; I Seer all Raa (Tee ‘ 
X1 N 1x8 N N2 Xai Sf) 


This agrees with the value obtained by Isserlis from the differential equation to 


the hypergeometric series and thus confirms his result obtained by a totally 
different procedure*. 


* Proc. Roy. Soc. Vol. xcut. p. 28. Our notations are different. 


VOL. 11 i 
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With these formulae we can now proceed to discuss the mean value of ¢? and 
its variability. 


(4) Mean Value of ¢* (continued). 
At the end of (2) we had arrived at the equation 


1+¢=S8 ( x) + Mean S (Sr) . 


NA NA, 
and substituting from (a) we now obtain 
‘ i? 1 , (i, jis\) rs 
oe aia m ami wos <n a 
1+¢ =8 (yu) + my 5 (1 N)} iechientcneoueiall (viii). 
We can usually put x, = 1 and write 
= 1 n,® 1 n 
2 — mt anes ce io — ix 
1+¢ =(1 x) 8 (we) + 75 (3) plas bli t (ix). 
In the particular case of the contingency table 
* l fiue?\ . 1 9 fNiing 
1 + ¢? = (1 —_ x) S (ee) + WN S G2") Soe ccewvcesosveosecs (x). 


Rute 


n...2 
Now 8 (=. =1+¢,?, where ¢,? is the mean square contingency for the whole 


population, so that the mean value of ¢? as determined from a large number of 
samples is in excess of the true mean square contingency by 


w (8(q5)- +929). 


(5) Standard Deviation of 4°. Non-approximative Formulae. 





From the equations 


- b. i,” 2 1,ON, 1 (5n,)? 
1+ $+ g=8(H5) + yS("E™) + S(QD): 


8 


= 7,” 1 Nis iis 
and 1 + ¢? = S (wx) + X1 N S {i (1 — x) 9 
8 8 * 7,8 8 Ns Ns ) . 
we have N&¢*=S (‘S ) + 28 (* . ) — x18 ti (1- Nt aia (xi). 


Squaring, summing for a large number of samples and dividing by the number of 
samples, we have 


N2o%4. = Mean N? (5¢?)? 
=n [f(a eae O-AY 
(4) 9 8) fh(0-9)}-40)] 
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=a [5 (G3) a (ES) +a (8 


wags (CRE) «a5 8) +a (YL 


3 f\s' s \ A? 3f\s’ 
n,? (N iss ni,)* agit (N ii Tis) (N nz fiz) 
es(a ae )-etssGay we) 


where 8 denotes the same as S, i.e. summation for all values of s, and SS denotes 

ss 
mantel for all values of s’ except s = s’ followed by summation for all values 
of s*., 


Substituting from the equations (a), (b), (c), (d), (e), (f), we have 
7n,(N — — ii 
AN) (5g, WO -1) + yh 





a ae 
N?20*4: = S \yaxs W V 











Ng n,(N Nis) (N 2n :) Ny n f 
Bo 45 i X1X2 — a es } a 4855 1 hy uw 6h Um! aad 
ni,” (N Tis)” TgNy (N Ms) (N fy) 
x°S iy Na " = x'S 8 hee roe “ee 


3" 





Rei [xa + ( 3 (8x ii 2 x—¥) ig + (4 — (6x3 + 12x, — 2x) y) n2 
7 (- Mu + (3x, + 8 1\-4 
N Xs X2 — X1) wi) Ns 
N,N NeNy N,+ Ns’ 
i ss {ih | (xs a xs) igtiy + (x1 — X3 — 2x) N,N G. + iy) 
4 
+ (-¥+ yt (8xs + 8x2 — x1) x) a, ale 


Now it is evident that in numerical work the double summation would involve 
much extra labour, but we can get rid of it by using the identities 


(s(;:)) - 8 (fa) + 98 (3551). 
3 (j:)'s (Rt) (Ra) + 88 GY") = 8 (FA) +458(KE +H), 


(s(X)) - $Ga) +88 Gar), 


and so reducing all to single summations. 


after expansion and rearrangement. 


* As this notation may be somewhat unusual, it may be better to make it clear by taking a case 
with three variates only, for example: 


(Sn,)? =(n, + nq +3)? =n,? + nq? + Ng? + (Mg + Ng) Ny + (Ng +N) Ne + (Ny +Nq) Ng =Sn?2+SS8n,n,. 
s 8 83s 
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This leads to 
~ uP Xa\ 
N*o* = xixXa5 (y9) + x1 (Xs — x1 + 7) 7S 

ni,” 
( (x3) 
2x: - — 4x, a ii, 
+) 6 (5) 5 (3) 

_ (4x3 + 8x2) (35) 
(4 We) 8 (55 


4 : 1 7,*\)? ” 
+ X1 (- x + Bx + 8x2 x1) ya) {s(**)} «oe Kid). 


8 


(6) Standard Deviation of $7. Approximate Formulae. 


The result of the preceding section is an exact one since we have neglected no 
terms in arriving at it, but as mentioned before we can usually take M to be very 
large compared with N and make x, = x. = x3 = X%,= 1. With this simplification 
equation (xii) becomes 


som $ AGRE) - OR 
+ ya{6 ” (;: :) — (He) . () ahs {s (at les (ris) | 


+H Is (i ‘) + {s (x) _ 28 (3%) | EON AR (xiii). 


In the great majority of cases it will be impossible to make rigorous use of this 
formula since we have no other knowledge of thé whole population than what 
is given by the sample. In particular the 7’s are usually unknown and we must 
simply make use of the approximations at our disposal, namely the n’s of the 
observed sample. 

Again, it will usually happen that while 7, may be fairly large 7, — A, will be 
small and it will give formulae which are much more convenient for computation 
if we write #, = 7, — A, and substitute #, + A, for 7, in equation (xiii), remembering 
that S (A,) = N. 


After some reduction the formula becomes 


— $s 
iN i+ [98 (53) + bitte )8(f)-12 28 (Hi) + + 2—Ag)o—22—B6p*— 104] 


«LH 8) oa) oe msh oaea 





where c is the number of classes or categories in the population in question. 
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(7) First Application. Contingency. 


As mentioned in (1), if we regard the sth division as the (u, v) cell of a contin- 
gency-table and if we take 


then 


and (Me - sr)’ 
Per cee 2 | 


— = Mean square contingency. 
Ny Ny 


Accordingly, with the notation 


/ op 2 {(r = were) 
8 
* ime bated mec te 
the equation (xiv) gives the standard deviation of the mean square contingency, 
when M is very large as compared with N. 

The terms enclosed in the first bracket of (xiv) are exactly those of Pearson’s 
1914 paper in Biometrika, so that the second and third brackets contain the terms 
arising from the squares and higher products of 5n,. 

Of the total correction due to the higher approximation it is of interest to find 
how much is due to the change of mean and consequently the change of origin of 
5¢?, when the square of dn, is not neglected. The true mean is given by 


= i n2 1 n N. 
ee pes pe 8 ees 
1+ =H S(H) + 754K (1 x} 


and using the observed values of n, as the best approximation available for 7, 


1+ = (1449 (1-9) + 7 8(“™) 


= (149%) + 5 184) 42+ e—h, 


so that the difference between the true mean and the approximate mean obtained 
by neglecting squares of dn, is 


#— $= 5 18(%) gt + 0-1}. 


In accordance, then, with the formula for change of second moment with change 
of origin we get the effect of the change of mean on o*, by subtracting 


Wi {s (*") —¢*?+c— 1} 


from the approximate value. 
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In the examples given below it will be seen that this is only a small part of the 
total correction and thus the main part of the correction is due to the retention 
of the squares and products in the value of (8%)? used in (5). 


(8) Numerical Illustrations. 


I. Contingency between Handwriting and Intelligence in Girls. 


The probable errors of the contingency constants in this table have been worked 
out both in the 1905 and in the 1914 papers and below is given a table showing the 
effect of the corrective terms of the present discussion. 


The new summations required are found to be 


- .. or 
S (i) = 12-788, S (x) == 15982, 
(pe? é Nyps\ _ nn. 
S (Fs) = 93 144, S ( “4 = 57270, 


and in the following equation the numerical values of the various terms of equa- 
tion (xiv) are given in the same order as their corresponding algebraic terms: 


4 
ee 4 er 
0% = yap] [14865 + 09580 — -00918] 
1 RR 5 A 
7 -404 — 1: 58-205 — 22 — 5-365 — -092 
+ (ggo1y2 (558 864 + 97-404 — 1-784 + 58-205 — 22 — 5-365 — -092] 
1 
a 4+ 15982 + 164 — 186 + 870 + 1224]. 
+ Taggiys (57270 + 15982 + +9 ] 


The other calculations are summarised in the table below: 


TABLE I. 


aac 
p? = -09580, C.= = a = +295 
. ’ Ji + ¢? 1. 




















Various formulae used 
Blakeman and | Ist Term of Ist and 2nd | All Terms of 
1 ie08) | ai ae Terms of (xiv) | (xiv) 
r See ee 3. ll | 
O42 ++ eee oe -02023 | -02286 -02709 ‘02729 
Probable error of ¢? -01365* 01542 -01827 ‘01841 
ee one nee -0285 -03219 -03815 | “03844 
Probable error’ of C -0192 02171 02573 02593 





In this table the work has been carried out to four significant figures with a view 
to showing the corrective effects of the various terms. It is apparent that the 
fineness of approximation given by (xiv) in full is more than is required in practice, 


* Incorrectly given as -0042 in Blakeman and Pearson’s paper, loc. cit. footnote p. 196. 
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but there is a considerable difference between the values given according as the 
second term is used or not and it seems that in some cases it would be advisable 
to calculate this term or at least the most important terms in it, viz. 


ys (6s (¥5) + 88 (%*) + 2c) s Se bcagenubeneereeese aes (xv). 


Using this approximation in the above case, we obtain oy: = -02730, a result 
which—by a mere chance, of course—is almost exactly that given by the full 
expression in (xiv). 

II. Contingency between the Hair-colours of Pairs of Female Cousins. 


In the example just given the total number N in the sample was fairly large, 
viz. 1801, and it might be expected that in smaller samples the corrective terms 
would be of increased importance. With a view to testing this a contingency- 
table given by Miss Elderton in her Memoir on “The Measure of the Resemblance 
of First Cousins”* was selected. There are 36 cells in this table and the total 
number in the sample is only 218, there being several cells with zero or very small 
content. : 

The Table is given in full on p. 226 along with the quantities required for the 
calculation of $7 and o4:; it is there evident from the figures how large a proportion 
of the variability depends on the cells of smal! content. This is of course to be 
expected but the importance of having large numbers in all the cells is not always 
appreciated. In this particular case physiological reasons would prevent us 
from clubbing together the “Fairs” and “Reds” and with the fewness of the 
observations at our disposal we must use the table simply as it stands. 

The scheme followed in each cell of the table is shown in the last column, 
and in the marginal tota!s are given the values of all the summations required 
for (xiv). These are 








g= 8( mi) = -14895, s (5%) = — 2336-8 
s (%) =—1-8481, 8 1(¥*)} = 19-162, 
s () = 5170-9, s (75) = -08277. 


When these are substituted in equation (xiv), we have, preserving the algebraic 
order as before, 


ot = _ [-08277 + -14895 — -02218] 


+ Tsay [114-9714 — 13-6837 — -9932 + 50-5515 — 22 — 8-3411 — -2219] 
+ aig [— 2336-8 + 5170-9 + 3-4 — 38:3 + 125-7 + 1224], 


and the whole work is again summarised in Table III (p. 227): 
* Eugenics Laboratory Memoirs, 1v. Cambridge University Press 1907. 


Biometrika xt 


























Second Female Cousin 





Brown 





Light 
| Brown 


Fair 





Red 


Very 
Dark 


— 00298 
11 
9-495 

22-96 





1-0000 
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TABLE II. Contingency between the Hair-colours of Female Cousins 
First Female Cousin 

a Brown -_ Fair | Red Totals 

11 3°5 11 15 CO i) 36 =Sn, 
9-248 6-523 9-495 3-963 | 0-826 _ (As) 

23-57 33-42 22-96 55-00 263-93 435-55 = (N/A,) 
‘1894 | --4634 +1585 —-6214 | ~ 1-0000 -1-2230 =S(W,/A,) | 
-0359 +2147 -0251 -3861 1-0000 19259 =S(~,2/A,?) | 

4-46 ~ 15-49 3-64 — 34-18 — 263-93 — 286-66 =8 (Ny,/d,2) 
00152 00642 -00109 -00702 -00379 02705 =S (W,2/N2,) | 
-00029 -- 00298 -00017 ~ -00436 ~-00379 ~-00697  =S (W,3/N),2)| 
13 12 10 9 1 56 

14-385 10-147 14-771 6°165 1-284 —_ 

15°16 21-48 14-76 35-36 169-78 280-11 

— -0963 -1826 — +3230 -4598 — +2212 -1913 
0093 -0333 -1043 2114 0489 4431 

~ 1-46 3-92 -~4:77 16-26 — 37-56 ~19°15 
-00061 00155 -00707 -00598 -00029 -01702 

~ -00006 00028 |  —-00228 00275 | ~ -00006 .00092 
12 8 9°75 3°25 3 39°5 

10-147 7-157 10-419 4-349 0-906 _- 

21-48 30:45 20-92 50-12 240-61 397-00 
-1826 +1178 ~ -0642 — +2527 | 2-3111 1-8312 
-0333 -0139 0041 -0639 53412 56711 

3-92 3-59 - 1-34 12-67 | 556-07 534-08 
-00155 00046 -00020 -00127 | -02220 -03210 
00028 -00005 - -00001 — 00032 | 05131 04833 

10 9°75 16°5 9°25 1 57°5 

14-771 10-419 15-166 6-330 1-319 — 

14:76 20-92 14-37 34-44 165-29 272-74 

— +3230 — -0642 0880 -4614 — +2419 0788 
-1043 0041 -0077 -2129 -0585 -4126 

4:77 ~ 1-34 1-26 15-89 ° — 39-98 — 25-30 
00707 -00020 00054 -00618 00035 01543 

~ 00228 - 00001 | -00005 00285 — -00009 00069 
9 3°25 9°25 1 i) 24 

6-165 4-349 6-330 2-642 0-550 — 

35:36 50-12 34-44 82-51 396-38 653-81 
-4598 — +2527 -4614 — 6215 — 1-0000 — 1-5744 
2114 | -0639 +2129 -3863 1-0000 2-2606 

16:26 | -12-67 15-89 —51-28 — 396-38 — 462 36 
00598 00127 -00618 00468 -00252 ‘02766 
00275 — 00032 00285 — 00291 — 00252 — 00452 
1 3 1 ri) 0 5 
1-284 0-906 1-319 0-550 0-115 _— 

169-78 240-61 165-29 396-38 1895-66 3131-65 

— +2212 23111 ~ +2419 ~1-0000 | — 1-0000 — 1-1520 
0489 5-3412 -0585 1:0000 | 1-0000 8-4486 

| ~37-56 556-07 ~ 39-98 -396-38 | - 1895-66 — 2077-44 
-00029 -02220 -00035 -00252 | -00053 -02968 
~ -00006 05131 ~ -00009 — 00252 — 00053 04431 
Marginal Totals are the same as for vertical margin $= 14805 


















































S(N/A,) =5170-9; 


S (W,/A,) = - 18481; 8 (2/02) =19-162; SS (Vy,/A,2) = — 2336-8; 8 (,3/VA,?) =-08277. 
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TABLE III. 


i ’ ¢? a 
Or ee’ i 9! = >. ee 
d = 1486 J, CL 1 2 36005. 





Various formulae used 


Ist Term of Ist and 2nd Terms | All Terms of 
equation (xiv) | of equation (xiv) equation (xiv) 


| 
| 
| 
| 


| 

emer = Ses ea ee 
| 
| 


| 062006 





Og? +>. eee coe | 079848 -082317 
Probable error of @? | 041822 053857 *055522 
We “exe ae aaa -16902 -21765 *22438 
Probable error of C | -11400 -14680 *15134 





The relative importance of the three terms of equation (xiv) is not markedly 
different in this table of small total content from what it was in the case of the 
Handwriting-Intelligence table and we cannot base different conclusions on the 
two cases. 


Again, using the approximation given by selecting the large terms from the 
second bracket of (xiv), viz. 


ip. H\ 
6S (Fs) + 8S () + 2c, 
we obtain Oy: = *0864, 


which as in the previous example is a reasonable approximation to the full expression 
result. 


(9) Second Application. Test for Zero Contingency. 


Suppose that we may expect in the sampled population an absence of contin- 
gency or correlation between the variates considered. In other words we will 
suppose 





¢? =S8 (ns — ii,) 


which is the mean square contingency in the case of a population with zero 
correlation, would certainly not vanish. The problem then arises: How great 
may the quantity ¢? be without making it highly improbable that the sample 
in question is really a sample from a population of uncorrelated material ? 


First of all, the mean value of ¢? as determined from a large number of samples 
would be 


COP Peete eee eee seseeeeeeseeeese 
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as is obtained by substitution of A, = 7, in (viii), or, if x, = 1, 


= 1 
3 =: —. — 
¢*? = N (c — 1). 
In the same way we derive from equation (xiii) 


c(c— 


ot = he (gt oe) + alt — D- rle— DF... 


where H is the harmonic mean of the mean cell contents, or for the usual particular 
- case when M is very large as compared with N 


, je . ee ee 
o742 = w2 1F + —" and. J 2 (ce — } sictbcevedurenenee (xviii), 
an expression which is very easily calculated especially as i will usually be small 


compared with ¢ and hence a good rough approximation for a fairly large table 
will be got from 


Thus if we take twice the standard deviation as a limit to the probability of a 
deviation being that of a random sample, we have as a rough upper limit to the 
value, which 4? may be expected to take in any sample, 


Zo-nery 


(10) Numerical Illustration. 
In the example of the Contingency-table for Handwriting and Intelligence 
in Girls 


$2 = x (c — 1) = 01943, 


and when calculated from the more exact formula (xviii) 
O42 = °004879, 
the approximation given by (xix) being -0046. 
Hence in accordance with our assertion above, we should regard any observed 


value of ¢? which exceeds -01943 + 2 x -00488, i.e. -02919 or, say, -03, as being 
incompatible with zero contingency. The observed value of ¢? = -0958. 


The corresponding mean value of C—the coefficient of contingency—is 


01943 
1-01943 ~ 19806, 


and the upper limit for C according to our assertion is -1684 or, say,-17. The 
observed value of C is -2957. Clearly there is definite association between 
intelligence and handwriting. 
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(11) Summary of Formulae. 


It will be convenient for purposes of reference to have all the formulae collected 
into one section. 


General Formulae. 
For ¢? defined by 





or 2_ § eae 


where N is the number in a sample, n, is the number in the sth division of that 
sample and A, is a number connected with the sth division satisfying the condition 
S (A,) = S (n,) = N, we have proved that for an “infinite” sampled population 


1+ #= (1-4) 8(fR) + s (5) cine (B)*, 





and : 
c= t (AG) +04 
+ ;| 6s (¥5) + (8 — 44%) (¥: ) ne 128 (Fe )+ +(2—46%)e—22—5642— 109¢| 





+ LoCSE) +98) {a(R} -a9(fs)+20-95(R)+ee—2] 


eh oeanecesns , 
where c is the number of divisions or cells and y, = 7, — A,, or, with a fair amount 
of approximation, 


Hig y [s ( ¥5) +9? - #49 ys | 68 (Ss “) +88 (%) + 20| ...(D). 
Contingency. 


In the case of a Pailin the sth division may be taken to be the 
(u, v) cell and A, = 





“— *, where n,, and n, are the marginal totals of the uth row and 


the vth column. Formulae (A), (B), (C), (D) are then directly applicable. 


Test for zero contingency. 
When there is zero contingency in the total population 


A= RyRy =i 
3 N — 
and (B) reduces to 
= i : 
s = N (ec — 1) PPITTTTITTI TTI LITT TTT (B »; 


* As usual the bar over a letter denotes “the mean value of.” It is to be noted that usually there 
is only one sample and the value of n, in that sample has to be taken as fi,. 
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and (C) to 
 , ofe— 3) 


1 ' 
O42 = N? i at. N aa 2 (c _ 0} eadcersionece 4odest eae (C a 
where c is the number of cells in the table and H is the harmonic mean of the cell 
contents. 
Rough approximations in the case of zero contingency are given by 


* c 


ea 
¢ = 
and oy =H. 


and from these we can derive as a rough upper limit to the value of ¢? given by a 
random sample from a population of zero contingency 
9 V2c_c¢ ( —") 


N WN We 


c 


N 


+ 























ON SOME NOVEL PROPERTIES OF PARTIAL AND 
MULTIPLE CORRELATION COEFFICIENTS IN A 
UNIVERSE OF MANIFOLD CHARACTERISTICS. 


By KARL PEARSON, F.R.S. 


(1) Let the universe consist of N individuals each the bearer of n characteristics 
symbolised by the numbers 1, 2, 3, ... 8, 8’, 8’, ...m respectively. Let r,, be the 
correlation coefficient of the s and s’ characteristics, and let the whole system of 
total correlations be provided by the determinant A, where 


2 i 
may ©, We cee War sex Gee: | wetintrnceseeesean (i). 
Ta, 1, Tes Ton 
si, Ts "ss Tsn | 
bay Pee ee wes 


This determinant being symmetrical because r,, = ry. 

We shall use A,, for the first minor corresponding to the constituent in the 
sth row and s’th column, and A,,,»,» for the second minor corresponding to the 
first minor of A,, which is associated with the constituent in the s’’th row and 


s’”’th column. 


Then if Ry. 105... is)... 
characteristic on the other »—1 characteristics, i.e. the n without s, and 
ss'P193... (s)... (s’)...n Genote the partial correlation coefficient of the sth and s’th 
characteristics for the remaining n—2 characteristics constant, the following 


denote the multiple correlation coefficient of the sth 


results are fundamental and well-known: 


“ A m 

R?, 193... s)..2=1- Ee eee taeee (ii), 
ay ns 

os'P 103... H...(s)..2 = — a ree (ii1) 


To abbreviate the subscripts we shall write these 
Ras and ss'Pi-n> 


but where others of the variates are to be left out of account we shall be obliged 
to introduce the bracket system to mark partial or multiple correlation coefficients 
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of lower orders. Thus R,.3~(~(-» or simply Ry. (yy, would signify 
the multiple correlation coefficient of the sth characteristic for n — 3 other charac- 
teristics, s’’, s’’’ and of course s being excluded. Similarly ,,p,_ (ss), Would 
be the partial correlation of the sth and s’th characteristics for n—5 other 
characteristics supposed constant, i.e. the m variates with the exception of 
a, ¢, €, t Gna a. 
(2) Preliminary Propositions. To express First Minors in Terms of 
Second Minors. 


While values of the second minors in terms of the first have long been known, 
i.e. the equations of the form 


Ass Ave — Aw A,” 
Te eee (v), 


the reversed results, i.e. the expressions for first minors in terms of second, seem 
to me less familiar, at any rate I have not come across them in the literature with 
which I am acquainted. 


The forms to be demonstrated are 


A? 
Bas = go (Anes Messe” — Magy?) evcssecesseessssseesssssees (vi), 
ss’s 
A? 
Ber = — (Aveere Mawar + Ayres Myrr'eg)eososeessssseeen (vii), 
ss's 
where Wate | Bigs Beats Bagh | xacxsiveeseosiveccveeest (viii), 
| 
| Ays, Avs; Ays 
| A, 8) Ay's ? A, 8 
and Moe Ms “Ras hha + onc (ix). 
- Ag's'ss's Age's; aac Agse’s” 


at Ags's’s; = Aseq's”, Agss's’ 
To prove these results I start from (iv) and (v) to express the value of 


Agas's’ Asses TF Ages" Ag's'ss- 


We find 
J a! ye _f8 au vee iji—_ , a” 
Agae's Ness's” av Agsss’ Ag's'ss — ae . ; = " ‘ (Ounie Se ae 


erties — Berle) . Carrls — teed 
Bp oa sen AME. Mons i aR, Ri. TTB an 
A A 
Ags" 
= Ar (A, Ag, Ass" vo Ass A*y in Ay,y A’, = Ayy A*,y + ZA gy AgsAgy) 
Ay’ ° Vss's” 
A? ‘ 


which gives us at once (vil). 
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Similarly we have by (iv) and (v) 
Ages's Asez"s” = A? 55's” 
= A, Agy = A*,y x Ass Ags” ae A, at (A,,Ayy” Ez 
A A A 
= 4 (Ags Avy Ayy — A,, A*y,” Far Ags Ay, ar Ayy A*,, + 2Aye Ay, Aye) 


A,, A,,”)* 
2 





which gives at once the result (vi). 
We shall now substitute values like (vi) and (vii) in our definition (viii) of 


Vs” and so deduce (ix). 


We have 
Vo As Pm) ( Agss's Ag's’'ss’ ) ( Asya” Av's'ss” ) 
sss 7 
(Vi54")® a A? 6s'e” ; - Ases's Ags's’s ; = Asss’s” Ag's's's F 
( Asses Age's’ ) a ( Aysres” Asss's” ) 
+ Age's Avs's's ; wai A*yas"s ‘ + Ayass” Ag's'ss 
f Nese Ava'ss” Agss’s” Ases's” Ayry’ssAg's's's 
+ duties” eee’ ine 








but the determinant on the right is the square of the determinant 
Ages’s's — Agts’ss'> — Ass's’s rE 
ae Ag's'ss's Ag's'ss; ie Auss's” 
ei Agy's’ss as Asses" Mazes’ | 


whence by taking the square root we have the result (ix). 
(3) Application to Multiple and Partial Correlation Coefficients. 


Equation (viii) may be put into the form 

















ey ee or ae oe, oe | 
VA, Ags Aya Ayre” 
Ass 1 Ave | 
VAs Ave : V Bye Mery" | 
Ass" . __ Ager 1 | 
VA, Age WAgAyy ; 
= —— A* 1, —ss'Pi-n» —ss'Pi—n 
(1 as R?,..-n) (1 - R?,,, a) (1 = R?,, =a — ssP1—-2> 1, — yyPi—n 
1 | 


—ss"Pi-n» — s's"Pi-—n> 


jteseouenoee (x). 
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Turning to the expression (ix) we may write it 





a Ag's'ss — Ave | 
s's‘s''s 

V2 065" = AS Agyes Ag's'ss A see's 1, Ws SESS UA ? VA aes ‘ee | 
Ages's Av's'ss Ayys's' Agas's | 
| nt Age's : 1 — Asse” _ | 

/ ee + 

| Ayss's" Age's VAy's 8s Asss's | 
| = Agss's but Ages's” 1 
| a > ————"— » 
| V Ages's’ Ass’ V Ayres Agss's’ 


The determinant on the right can be expressed in terms of partial correlation 
coefficients of one lower order, i.e. it equals 


1, ss'P1-(s")—n>» —ss"P1—(s')—n 
ss'P1—(s")—n» A, s's"P1—(s)—n 
ss"P1—(s')—n> 's"P1—(s)—n> 1 


Further: 
MD Aggy e's As's'ss Aass's = A‘.— 





° _ 6 — 
Also: a hf, in ‘a aa 
a. - 1 = 
(1 ai R*,. 1—n) (1 os R*,y sn) (1 “> R?y, 1-n) 
1 
x =— — ———— —. 
- (1 — RB, -1)—-n) (1 — By. 1-ty—n) (1 — By. 1-We1—n) 
us: 
(1 7 R?,. 5-1¢'1-n) (1 rr" By. 2-2) (1 haat R?y, 1-Ws)—n) 


= (1 ‘ata R?,. 1-(s)—n) (1 = R?, 1-W"—n) (1 iv: RB? 1-1 -n) “Sees (xi). 
This is a relation between sets of three multiple correlation coefficients of the 
(n — 2)th order. , 
1 


Clearly V2 O63” = A’, —_—_—_ Se 
E (1 . R*, 1-n) (1 — R,, 1-n) (1 ar R*, . 1-n) 
] 





“@-F l- Bo -e 
x 1, pe ee ey BSP ee eR ES. (xii). 
ss'P1—(s")—n> 1, s's"P1—(s)—n | 
ssPi—(s')—n> s's"P1-(s)—n; 1 | 
Squaring (x) and combining with (xii) we find 
| 1, — ss'Pi-n» — ss"Pi-n 
— ss'‘Pi-n> 1, — s's'Pi-n 
| — ss"Pi-n: — s‘s"Pi—n> 1 “ 
(1 — R%,.1-n) (1 — Ry. 1-0) (1 — By. 1-2) 
1, ss'P1—(s")—n» — ss"P1—(s'))—n 
| ss'P1—(s")—n> i, s's"P1-(s)—n 
- | e'Pi-ii-n» __¥e"P1—Wwi—a 1 ws 
“a-F,..00- oe 
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The denominator on the right-hand side of (xiii) can be replaced by the left-hand side 
of (xi), or by the square root of the product of both sides of (xi). This is a relation 
between partial correlation coefficients of the (n — 2)th order with multipte coeffi- 
cients of the (n—1)th order and partial correlation coefficients of the (n— 3)th 
order with multiple coefficients of the (n — 2)th order. In the particular case of 
three variates 1, 2, 3 with total correlations ry, 723, 73; (xiii) reduces to 





| 1, —12Pa, —13P2 |? = (1 — R?,. 23) (1 — PR’. 31) = R?, 12) | 1, Tie, 11s 
— o1Ps3: lL, — osP1 (1 — 1745) (1 — #75:)(1— 1712) | ry, 1, Pes | 

| —g1P2, — 32P3; 4 | T31, T32, 1 

jcneeeeuseen (xiv), 


a result which might possibly be of practical value in testing the accuracy of 
the determination of the first order partial and multiple correlation coefficients. 


I now return to equations (vi) and (vii), and I divide (vii) by the square-root 
of the product of Ay, and A, obtained by cyclical interchange from (vi). 
We find 





Aver Aare Anever + AvitesAnreres 
VAys Ags V Agyers Ags'ss ei AP yye's V Ag's'ss Ayes as A? y5"ss 
Assz's” Ags'ss’ Ay'e'ss’ 


woe oe ——— = 
= V Ages Ag's’ss V Aves Agss's” V Age's Ag's's's 
= == ? 
2 
if 1 —Arease if y— — Aversa 
Ags'ss Age's’ Ag's'ss Ag's'ss 


or, changing sign throughout, 








Py» = En hh * iw 

oa v1- s’P*1-)-n V1 ~ ssP*1(9")—n 
(xv) is the familiar result for obtaining a partial correlation coefficient of the 
(n— 2)th order from those of the (n— 3)th order, but the proofs usually given of 
it seem to be based on some appeal to general analogy rather than to the definite 
algebraical form of the coefficients concerned. It was, indeed, a lecture proof of 
the relation (xv) from the determinantal forms of the coefficients which led me to 
the results (vi) and (vii), as apparently novel determinantal relations. 


on Goebevonaet (xv). 





We can next consider results (vi) and (vii) individually. From (vi) we find 


Ass A Agy Ags Aass's’ Asses” ( A? 69'4” ) 
= as’ Asss's’ Asss"s” (4 __ ‘ : 


A Vee A A Aye Aya A 


s3s's' Asss"s” 


or, converting into partials and multiples, and writing 
€ \ 
P,-2 = 1, —ssPi-n» ~—ss"Pi-n » 
— ss'Pi-n> Rs — ¢'s"Pi-n 


— ¢'s"Pi-n» 1 





— gs"Pi—n» 
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we have 
] Us (1 << R?, 1-n) (1 _— R*,. 1-2) (1 valle R?» sn) icsats 1 : 
Pe ‘Pas (1 — Hy, ,_,) (1 — Ry. 1.) 
1 
a J ce x 1 — yyp*s_i—n)- 
F R?,~6)—n) (1 — R2,. We) —n) ( P*1-19)-n) 
Hence 


ene R?,. 1~1s)—n) (1 bei R?,. 1~16"1-n) "Pas 
s's"P*1-()-n 
This is an expression for a multiple coefficient of the (n — 1)th order in terms of 
multiple coefficients of the (n — 2)th order and partial coefficients of the (x — 2)th 
and (n — 3)th orders. 
Equation (xvi) just obtained may be put into another form by aid of the 
relations* 





(1 say R?,.1-n)? - (l 


1— R? 
1 — vp"... = oF 
ss'P"1—n t | ae 
1— BR, .. 


| eee ape ee 
’ aati i— Re, wn 
leading to 








(1 et me, . 
1 — ,yp*y_ 1 — ga"P*1_-n) = c \ 
( ss'P"1—n) ( P*1-n) (1 — R2,. s-wr—n) (1 — Ry. 3- a) 
Tue p 
~ 95. by (xvi). 
1- s's"P*1-W—n y ( ) 
Hence L = yy"p*}~—-n = Pane 





(1 "a ’P'1_n) ccccccvceces (xvi)>!s, 

Thus (xvi)>!* is the reverse of (xv), giving a partial correlation of the (n — 3)th 
order in terms of those of the (n — 2)th order. For example, if there be three 
variates, 1, 2, 3, . 
= 1 — 32P1" — 132” — 21P3” + 259P1 - 19P2 - 21P3 ‘ 

(1 — y2ps") (1 — 33/2") 

s2P1 + 12P3 + 13P2 
v1- 12P3* V1 — 13P 2" 
which can be easily verified by substitution of the values of the partial correla- 
tions, or be seen at once from the polar triangle. 








leading to fo3 = 


For the particular case of three variates we may use (xvi) instead of (xvi)>!8, 

writing it . 

(1 — R?,, 95)? = (1 — Fw CS 1, —12P3, — 13P2 |---(Xvii), 

gs | — 12P3> 1, — a3? | 
— 13P2, — 23P1> 1 | 

which will be found on substitution of the values of the partial coefficients on the 

right and the multiple on the left to reduce to the familiar result that the square 

of a 3 by 3 determinant is equal to the determinant formed by its minors. In this 

case of tree variates, if 723, 73,, 72 be taken as the cosines of the s'des of a spherical 


* R. S. Proc. A, Vol. xct. p. 496. 
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triangle, then .39,, 3:2, wP3 are the cosines of the angles, and Ry.5,, Ry.23, Rs.12 
are the cosines of the perpendiculars from the angles on the opposite sides. If 
a,b, c are the sides, A, B, C the opposite angles, pa, pb, pe the perpendiculars on 
the sides, then the above relation is 


sin26 sin?c |  B —cosC, —cosB |. 
—— 
sin* a — cos C, 1 — cos A 
—cosB, —cos4, 1 


It is greatly to be desired that the “trigonometry” of higher dimensioned plane 
space should be fully worked out, for all our relations between multiple correlation 
and partial correlation coefficients of n variates are properties of the “ angles,” 
“edges” and “ perpendiculars” of sphero-polyhedra in multiple space. it would 
be a fine task for an adequately equipped pure mathematician to write a treatise 

n “spherical polyhedrometry” ; he need not fear that his results would be without 
practical application for they embrace the whole range of problems from anatomy 
to medicine and from medicine to sociology and ultimately to the doctrine of 
evolution. 


sin! pa = 


Lastly we may turn to (vii), and express it also in multiple and partial correla- 
tion coefficients. We have 























Avs" AP Ages’s —_ osx Avs ae 
Vieehee  Voes’ Wie 
x ( ~ Sete “ poy x dene) 
V AysesAe's'ss Vi Avradverr VOrradvres!’ 
or using the correlation symbols 
_ (1 — Ran) (L— Ryan) (1 — Ryan) 1 
Py Pas V (= By io=n) 0 — B8a-09-n) 


1 sa l 


” Sl ~ Ws wad — Maga — Be — 


x (s's"P1-(s)—n eis ss"P1—-(s)—-n ss'P1-(s")—n): 








Or, 
3's"Pi—n — (1 gee R*, 1-2) 


x of (1 me R*,y 1-n) (1 We R*y 1») ales 
(1 — B,.1—wy—n) (Ll — R¥,.1-19"y—n) (1 — By .-te—n) (L — 2? 


x s's"P1—(s)—n — 83"P1—(s)—n X ss'P1-(s")—n 
e 








es1—te"—n) 





n-2 
This is a complicated form and unlikely to be of material service. 
If we use the symbol ‘P,,_, to represent the determinant 
1, ss'P1-(s")—n> —s8"P1-(s')—n | » 
3s'P1-(s")—n> 1, s's"P1—(s)—n 


ss"Pi-(s)—n» = o's"P1-(s)—n» 1 
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we can throw (xviii) into the form 


sz = ' ae = ae: ake Se 
ede ae | es PE 


s's"P1—(s)—n — s8"P1—(s)—n X gs'P1—(s")—n ( 
Alani ip liao raat 
Here on the right we have only partial correlations of the order n — 3, but the 
expression involves a multiple of the order n — 1 as well as those of order n — 2. 
Of course in the second radical s” and s’ may be interchanged owing to the existence 
of (xi). On the whole it appears best to confine our attention to (xv) and (xvi) 
as the fittest representatives of (vi) and (vii) expressed in correlation forms. (xv) 
has long been in use either to determine the partial correlation coefficients of 
higher orders by repetitional processes, or better to verify results given by (iii). 
The calculation of the higher multiple correlation coefficients by (ii) can be verified 
by the aid of (xvi) if the partials have also been found. But if once the value of 
A, has been determined the continuous product formula may often be advan- 
tageously used. Lodked at from the determinantal standpoint this may be rea“ hed 
as follows: 


A n A, ; 1; A 


8 





x 





= Os: ges's's""0" A,, to n—1 terms 
A,, ; A see's’ ; A ess's’s"s” : Asss's’s’s"s'"8"” ~ 1 
o (1 si R?, 1-n) (1 -s R?,,.1_10—-n) (1 bic Ry s_tes)—n) (1 ~ R21 tse's"—-n) «(1 —r1*pq), 
if the pth and yth variates be the last to be excluded. 
Hence* 
1— Ran = 


A, 


A, 
(1 baat Ry 1-1-n) oe Ry", (ss) —n) (1 i | oe (ss's")—n) ++ "= rq) 
oe lhe | 4 dete (xix). 

The trouble of this method is that A, has to be calculated at each stage, if 
we deduce R,.,_, by a repetitional process. If we use it merely as a verification 
process for (ii) we shall not verify A,, unless it is worked out twice. Of course 
A,,, if n be at all large, is more troublesome to calculate than the 3 x 3 determinants 
of formula (xvi). 





However the primary object of the present paper is not so much to provide 
verification formulae as to show by direct determinantal analysis certain relations 
known and unknown between the higher multiple and partial correlation coefficients. 


* This must not be confused with the formula 

: 1 — R?,.,-» i _ 
ql — gs’'P*1-(s’)—n) (1- 3’? —(a's"”)—n) (1 = 5/4 -(e's'"e"”) ~») eee (1 —1* pq) : 
see Yule, R. S. Proc. A, Vol. Lxx1x. p. 189, and Pearson, R. S. Proc. A, Vol. xct. p. 49°. 


1- as'P*12..-n= 

















ON THE APPLICATION OF “GOODNESS OF FIT” TABLES 
TO TEST REGRESSION CURVES AND THEORETICAL 
CURVES USED TO DESCRIBE OBSERVATIONAL OR 
EXPERIMENTAL DATA. 


By KARL PEARSON, F.R.S. 


Let us suppose that a sample of size N with class groups n,, is taken out of an 
indefinitely large population of size M with class groups v,,, these classes being 
arranged according to two variates x and y. Then the mean of any array of 2’s 
for a given range of y variates connoted by the centre, y,, of this (usually small) 
range will be 


Here n,,, the number in the 2,, y, class, and n,, the total number in the pth array 
of 2’s, will vary from sample to sample. But 2, and y, will remain of course the 
same. Now let m, be the mean value of m, found from a large number A of 
samples and let us measure m, = mM, + 5m, from m, and n,, from 7i,, = Nv,,/M, 
and n, from Nv,/M, or take n,,= ii,,+ 5%,,, and n»=i,+5n,. Here the 
differentials are statistical differences and do not at present denote that we are 
going to neglect their higher powers. From (i) we have 


m.+odm,=S ("2") \, ce dn, + (): ae ()'+ 
’ eee i ES lee ji jy e 


+8 (2) i e.. (°*)" - (fy + _ 
Ny Ny Ny Ny 


Now we shall sum this for all A samples (dividing by A) and suppose that third 


order powers and products of Stee and , are negligible as compared with lower 
Dp + 
order powers and products*. If 2 denote a summation for all A samples 
% (6m,) _ Z (Om) _ Z On) _ 
.- , "2° 


since all these quantities are measured from their mean values. Thus we find 


n (5n,)* x (872,,5%,)\ 
fi,= 8 (“2") (1 + oi) — ( ( ~ *-)) ; 
An, 


2 
9 \Np 


* Actually terms of the third order also vanish. I have not investigated whether this be true for 
* terms of the fourth and higher orders. 
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nae ts _ 
But * 2 Our in (1 - ®) _s SS Ps,) the ( a *) Be a (ii), 





























and accordingly 





I 





mean of the array in the sampled population. 


I 


Thus to a high order of approximation at least the mean of the array means is 
the mean of the corresponding array in the sampled population. This result 
cannot be taken as obvious, as the size of the array in the sample varies f. 


We can now write to a second approximation : 


ten” nt) (1 7 ‘er > ebne (, 2 Bie) 


> D Ny 





_ 8 Bran (a — ™,)} (1 at re) 


Dp D 





oe S {5Nqp (%q = My)} or S {5nNqpdNy (%q in My)} 
Ny Ti,” 


since dn, = S (8n,,). 


We shall now find the mean value of (8m,)? as far as third order terms. Let O, 
and Q; give the second and third order terms; let & denote a summation for all 
samples and A their number; write %, for z,— m,. Then 


LES (5p)? 

A ii,” 

18 {2 (8%)? 7} ma 2 S, {a (8%q',9%q"p) Lg Ey} 

A a? : A jig? ° 
2 

But x Se —s (1 e a) 


x (81%¢°58Nq"y) = n 
- = — 


= 








Thus 0, = ia, a ) ae S (7° q%q") seek 2S; (Rep Me’n HaHa") 


i? i.2 m2 
lp n,?N n,?N 
, - ~ 9 
a i Ny 1 {* Cael 


7 Ni i 


Dp 


D 


* Biometrika, Vol. 1x. p. 2. 
+ The assumption that the mean value of a character in a number of samples is the value in the 
sampled population is often made, but is nearly as often erroneous. Thus the mean value of the 
correlation of a character in samples is not the correlation of these characters in the sampled population. 
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Hence to this order we have om, = o;,,/ Vii, precisely the value of the standard 
deviation of the mean on the hypothesis that the array is of constant size equal 








to the mean size 7,. We next turn to the third order terms 0,. Here 
gee: 2 {3 (8%p%,) X see ta 
i n,* 
7 SX (Gn2,,5n,F7) SZ (5NgpdtNq"pSNp LyX") 
--2[* Me ee An,* ‘ 


But the summations marked by = can be expressed by aid of (e) and (f) on 
p. 244 below. We have 


- = 2iNqp Np 
0; det i,3 [8 fie» (1 a Wr) ( — Hf) ‘a 


~ 2,(1 — %) [Soi _ 2p (Si 
Rig? N iy W 


2 es eae 

= ii, 7(1 - y) , since S (fi,»%,) = 0 
According] 2 =@, 0. = O*niy A i 2 ( eke ™)t (vi) 
ely on, = Oe + Op = SPN — = (1— FB) p ceesceereenees . 


The corrective term O, is, I think, of marked importance for it indicates that the 
standard deviation of the mean of an array in a sample is not the same thing as 
the standard deviation of an array of constant size. It is not therefore legitimate 
to assume, as some authors have done, that the standard deviation of the mean of 
an array is given by the same formula, i.e. o7, Vig , as for an array of constant 
size equal to the mean number in that array for a large number of samples. 


Had we included the terms of the next order in (vi) we should have obtained 
in the curled brackets terms of the order 1/7,?. For a small array this might 
be equally important with the term already given in 1/N. Hence some caution 
must be exercised in retaining that term and dropping terms in 1/7,?; at the 
same time for the larger arrays 7,/N may be commensurable with unity. 


We may throw (vi) into the form 





2 
og My = 


G. 
mM, = 
Vi Ty 
to a first approximation, i.e. when we may neglect 2/N as compared with unity 
and 2 as compared with the mean number in the array. To assume without proof 
* “On the General Theory of Skew Correlation,” p. 14. Drapers’ Company Research Memr:rs, 
Cambridge University Press. 
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the above value to be true is not legitimate as there is no @ prior’ reason why 
the standard deviation of the mean of an array of variable size should take, even to 
a first approximation, the value of the standard deviation of the mean of arrays 
of constant size, that constant size being the reduced frequency (i.e. 7, = v,N/M) 
of the sampled population. Actually a better approximation to o,, appears to 
be obtained if we add two units to the mean frequency-—a correction which may 
be of considerable importance for small arrays. 

We next turn to the correlation of means and we have to evaluate the product 
of two expressions like (iii) for 5m, and 5m, where p is not equal to p’. We shall 
obtain, summing for A samples and dividing by A, the value of om, om,,Rm,m,,» Where 
Rn,m, is the correlation of the means. We shall again treat the square and cubic 
order terms separately: 


Square order terms in Om,7 my Hem, my’ 


= * (gis ] Fete dni [> Grete 
Dp 








Aint, ARy hy 
L (Snyp'SN,Ly)) | MyMy X (5n,5n,,) 
— m,S Vat - X 
iyNy Ny Ny! 
es S (TignNg'p'LaXq’) +m S (Tgp Nya) 
a oe ae 
Nii, iy Niigiy 
S(Tigy'NpLy) MyMy TzNy 
+ My —~Waie — —N : 
NpNy’ NpyNy! 
_ _ MpMy | MyMy | MypMy _ MyMy 
N N N N 
= 0. 
Thus we see that as far as square order terms ‘are concerned: 
Ta OO  erctadscntetiancemmaainnst (vii), 


notwithstanding that there exists a correlation between the numbers in the two 
arrays on which the means are based. So far this result is only true to a first 
approximation*. But we now turn to the cubic terms given by the product of the 
square terms in 5m, with the linear terms in 5m,. There results the four separate 
terms: 

ZS (8Mqy'Xq) S (SNqpdNyXq)} . m,S {X (6n,,5n,5nXq)} 





An, 7, An, 7, 
My S{X (8Nqy (5ny)?) Z} — MyMy =X {Sn (8n,)*} 
Any? Ry AR, Ty 


1 have evaluated each of these terms in succession by the use of formulae similar 
to those on p. 244, and all-these terms give the same result, i.e. their sum with 
proper signs to its constituents 


te “re(] 2 +) - "ere (1 _ 2My\ _ MyMy () _ Big) | MyMy 2ny\ _ 9 
Nii, N Nii, N Ni, N Ni, \ N/] ~ 


* I have given this result in memoir just cited, p. 13. 
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Interchanging p and p’ we obtain precisely the same result for all the four terms 
in the product of the square terms in 5m, with the linear terms in 8m,. Thus 
the cubic terms in the values of om om Rm, m,, are also zero. We have thus 

. ° ° ° 4 «Pes e . 
established to a high order of approximation that (i) m, = mean of array in sampled 
population, (ii) that there is no correlation between the means in any two different 
arrays, while (iii) to a lower order of approximation only o,,, = o7,/Viiy. 


Now we know that the distribution of means of samples of constant size taken 
from a population following any law of frequency approximates very rapidly as 
the size of the sample increases to the normal law. How far may we extend this 
result to the present case of the means of arrays the total frequency of which 
varies from sample to sample? 


With the view of considering the approach to a normal distribution, let us 
investigate the third moment coefficient of m, for the samples of the pth array. 
From (ili) keeping only lowest order terms we have 
S (8%qn%.) 


bm, = Z 


and accordingly . 


Z (5m,)>_ 1 fy 2 ort, . fata 
X _ A 


+68 . Prabha etter 














or, using (a), (b) and (c) on p. 244, 


y }— Nap 2Nqp ~ 
d i, Is ley (1 _ 7) (1 — N ) zl 


1 3 m ae 
= — | S (ign X,3) — x5 S (NgyH%q?) S (Ngp%) + ae {S (Nap | - 
. | ap"'@ N ap*@ apa N2 apa 
The last two terms vanish with the factor S (7,,%,). Thus 
x(dm,)?_ 1 


"= 3 al’3 > 
A Ny 


where ,z is the third moment coefficient of the array about its own mean in the 

population sampled. We have seen also that to the same degree of approximation 

oO ae | where H,= 07; . Accordingly if ,B, be the value of the first 
r ip’ a Ny ; 

B-coefficient for the distribution of the means of the pth array in samples: 


. {" eee eet 
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where ,f, is the first B-coefficient of the pth array in the sampled population. 
Hence whatever be the nature of the frequency distribution in that population, 
if the number in the array be not too small, the distribution of the means of arrays 
will approach symmetry, i.e. ,B, will be small*. 


The following values have been used in the previous investigations, some of 
which have and some of which have not, as far as I am aware, yet been publishedf. 
The remainder will be required in investigating ,B,, or the value of = (5m,)4/A. 


' Third Order Mean Products for Random Samples in the case of an indefinitely large 
Sampled Population. 











(b) Z {BMqy)? SM¢p} _ fest (1 ae), 

(c) ZX (SMgpdNypSMq"o) _ Bheoeetive 

(e) ne ~— i (a a 7) : 

(h) ~ ene _0, 

(i) = Coit) “i fee (1 Z *s) (a, — i,) 

(3) ea ..< *(1-%). 

() afm ienerl — Bee Mc, — mi) — ort 
(i) ae ‘apd _ = (2g — Tiiy) (: 7 ae) ; 


1,’ 
apap — —_ 
a {%, — My + Ly — M,}. 
Dp 


(m) >» sii eli _ _ Mgnt 


* Even if the array diverged considerably from the Gaussian value, i.e. »8, = -2, say, then »B, for 
an array of even 10 only would be but -02 and the asymmetry very slight. 

t (a) and (9), Phil. Trans., Vol. 186, A, p. 347 (Pearson, 1894); (6) and (c), Biometrika, Vol. 1x. p. 95 
(Soper, 1913); (d)-(m) probably here for the first time. 
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The following fourth order mean products are deduced from the more general 
values given by Isserlis*: 





(n) >» (8 qp5Ng'pdNq’y5Nq”’y) =: (1 Lk ¥) n plg'pNg'pNq'"y 
A 





I 


(0) » {(8%qp)? 8N9'pdNq"y} = (1 oe x) Nap Ny'pNg"y (1 az a) 
A > 








N N 

(p) E {(3Mep)® Stty'y} = — fetes {3 (1 é x) Tiep (1 xe “e) + if, 

@) {Gre}? (Srqy)"} _ Fest» (1 _ (1 is e S Aye nm Shoat 2) + xt 
while 

6 = per = meo(1— FF) {8 (1-H) Fo(1- HF) +} 


is due to Pearson f. 


I shall now use (n) to (r) to determine & (5m,)*/A to the first order terms. 


= 2 [s (ESaeR) 4. a5 (2 Gnel* imate) 


Re 


~ 


+ 6S (= (5% p)* Coal ere" + 128 (= (nl Dar De alee Te ) 
A 








+ 245 (“Bodresdnerdnetbe Fee 
2 : | 


Substituting from (n)-(q) above we obtain the following result: 


. ‘ eigen 
: pe) 2)" a (8 fp (1 " x) (s (1 — 5) Fes (1 — 7) + 1) a} 
— 48 {" sone (3 (1 x) Tiep (1 = 7) + 1) a} 
2 — a. a 
+ 6S jie lap Ng'p ((1 as w( => V sg W sae i) Gs n) #3%,4| 
— 128 {fsePeen (1-5) (1-5) i Pigie| 
n 


9 
+ 248 {Pe d on ae. (1 = y) iit 


* R. 8. Proc. Vol. 92, A, pp. 28-29 (1915). 
t Phil. Trans. Vol. 186, A, p. 347 (1894). 
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I now rearrange this into terms not involving the factor 1 — 2 and those 


involving it. The former are: 


1 es NapNq'p sat 
2 [s{re(1 - 7) 8 a} — 4s {Pe W a 


= Gal Gent! 4) + 3 {8 (Figp%.2)}2/N — 48 (Figp%,2) X S (Map q)/N] 


Sip 
= is | et WN ? be co 


since S (7igp%,) = 0. Here ,p, is the sth moment coefficient of the pth array about 





the mean. I now take the terms involving the factor 1 — + They give 


1 2 a Tap Top —_ 22 n No'p Nap ~ 3~ 
za(2 =e wv) | 38 {i (1 W a ne)e ‘t 128 Megas (1 = Me) ati} 


a 6S ws ( _ a ae = _ “Rept ah 


3 Siti ity} 
+ 105 flabrBeaiees 521] 
= Fi (1 oh x) 1s (Fig %_2)}? — 2S (Rgp%_*) ssn a3” _8 Bais a)} . 


where the equivalence is most easily verified by expanding the latter expression 
and ticking off the corresponding terms in the previous one. 








— 128 | "Re 


Since S (7i,,%,) = 0, we are left with the single term 


3 2 
i? (1 “i 7) phi”. 


Thus we deduce combining our two sets of terms: 


2 (5iry)* _ pha +3 (1 e 7) te 


A a,° N/ ii,? ’ 
 «f (8m,)*)? __ phe” (Be 1 
By x | . 3 - #5 1843 (1-x)t- 


where ,B, is the second B-coefficient of the distribution of m, for the pth array, and 
pB, is the second f-coefficient for the pth array itself in the sampled population. 


Now to the degree of approximation needful 


X (8imy)? _ phe i 2 (1 - 7%) 
A ot, ii, , 





Accordingly 
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or B,—3—-*P—* + 15 (F - 5) 

Thus we see that the condition for normality of the array mean, i.e. ,B, = 3, is 
by no means so nearly satisfied as is the condition for symmetry of the array mean, 
»B,=0. For, while the latter might be fairly closely approximated to by an array 
of 15, even if the array distribution were not normal, yet the former even in the 
case of normality and of a big sample would give ,B,; = 4, a wide deviation from 
the normal. Thus the array must be fairly considerable for the distribution of 
its means to become reasonably Gaussian. As a rule the distribution will be 
reasonably symmetrical but is leptokurtic, i.e. ,B, > 3, and therefore is to be 
described by a curve of my Type VII or one of the form 


f x2 —8 
y=y(1+5) , 


rather than- by a curve of the normal typef. Practically multiple frequency of 
this form is at present: undiscussed and we are thrown back on treating the arrays 
as giving normal distributions of their means. Thus the assumption made by 
Slutsky in the paper cited p. 248 can onlv be considered as approximative, and 
that assumption was not legitimate until its degree of approximation had been 
investigated. It is singular that the goodness of fit theory can actually be applied 
with greater accuracy to test physical laws than to test regression lines. 


It is clearly only in the large arrays that the kurtosis (,B, — 3) approaches the 
normal value zero. What must be understood by “large” can easily be estimated 
roughly. For 

B,-—3= Pe + 12 nearly = 15/7, roughl 
pP2 ii, y » Tougay. 

Hence if 7, = 75, ,B, would equal 3-2, which is certainly a limit to what may 
be roughly treated as a normal distribution, Accordingly when we assume a 
normal distribution for the means of an array, we must remember that this is 
really very rough in the case of the smaller arrays, and that we only do it in default 
of a better theory. At the same time it must be noted that the small arrays will 
have less weight than the larger, and the error made in assuming their distribution 
normal will be of far less significance for the same deviation. Thus far we have 
shown that (i) the means of different arrays are uncorrelated, (ii) that the standard 


— , =F . i 
deviations of these means are given by o5,/Vi'",, where ii’, = fi (1— 5+ 2, 


* This agrees with the result given by Isserlis (loc. cit. p. 31) only when 7p = N, i.e. the “array” 
is as in his case a marginal total. 

+ For the curve to be of normal type « must be large. If the array were normal in the sampled 
population and the sample large, then s = (27, + 25)/10, and this would be only 8-5, if 7» were 30. 
T‘us the application of Gaussian theory to samples with even minimum arrays of 30 can only be 
5 ypproximative. 

{ The deviations in the means of the small arrays are likely, however, to be much more irregular 
and greater than in the case of the large arrays. 
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and (ili) that the distribution of the means of arrays is for the larger arrays approxi- 
mately normal, but that for the smaller arrays the distribution is approximately 
symmetrical and markedly leptokurtic. The small arrays will, however, have 
small relative weight and this is the only and not very satisfactory reason for 
adopting a Gaussian system throughout. Slutsky * assumes the normality without 
I think adequate investigation, and for that reason I have been unable to consider 
his paper as adequate and final in this matter. 


With the assumption stated above we can use for the multiple regression 
surface 


Z = ME 7a 4 
2 
where i',= ii, (1 _ x) + 2. 


Here m, is the observed value of the mean of the pth array, and m,, oj, and ji, 
are constants which if it were feasible ought to be obtained from the sampled 
population. S denotes a summation for all values of p from the first to the last 
array. Now if we suppose m,, og, and 7, known, we can calculate 
x? = § [Re (ms os mel . 
co Np 

for, say, the series of ¢ arrays, and 4s there will be ¢ independent variables all we 
have to do is to determine from this value of x* the value of P, the probability, 
corresponding to it in the tables for “goodness of fit” t under the value n’ = ¢ + 1. 
This applies to any form of frequency surface giving any type of regression line, 
i.e. locus of 7,. 


Thus far the problem looks straightforward, but,now arises the difficult question 
as to what values are to be given to i,, m, and og,, which represent the unknown 
sampled population. Usually in problems, as of probable error, where we have 
the unknown constants of the sampled population we replace them by the corre- 
sponding constants of the sample. But it appears a somewhat arbitrary course 
to do this in the present instance (as suggested by Slutsky }) for 7, and oj, but 
not for m,. I do not think it accordingly legitimate to substitute for n, and 
o;,, the sample values and leave m, to be determined from other considerations. 
Clearly since our object is to test the goodness of fit of the regression line we have 
to replace m, by f(y,), where 

My, =f (¥») 
gives the regression curve or mean value of the array of 2’s corresponding to the 
value y, of the other variate y. Of course this regression curve is determined 
from the whole series of observations and not from an individual array. But 


* “On the Criterion of Goodness of Fit of the Regression Lines, etc.” Journal of the R. Statistical 
Society, Vol. Lxxvi. p. 79. 

+ Tables for Statisticians, p. 26. 
t loc. cit. pp. 78-84. 
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m, is nevertheless subject to the probable error of the random sampling. I have 
shown that the probable error of m,. if the regression be linear, is 
S1AddeeV T= F(y 4 We — Bt 
VN oy” 
But this of course is not the probable error of m,, but of m, as found from the 
regression line, and will generally be small as compared with the probable error 
of m, found from a single array f. 

Just as we have found m,, however, from the whole system of observations, 
so it appears to me we ought to determine o*;,, and 7, from the whole system and 
not from a single array. The centre of oj, is m, and not m, and to calculate a 
value for oj, with m, as centre seems to be an erroneous step especially when, 
owing possibly to n, "differing considerably from 7,, m, is much displaced from 
m,. I hold that rd satisfactory results it is just as nesitel to find 7, and a5, 
from the whole range of data—not from the individual array—as it is to find m,. 
This means that we must have some knowledge of the form of the frequency 
surface, and until we have this we cannot apply the test for goodness of fit to 
the regression line. There are accordingly two separate factors remaining for 
solution after we have determined the regression line: 


(i) We need to determine i,. This is the total frequency of the pth array 
of z’s. It can clearly be determined, when we know the frequency of the y’s 
in this group. That is to say, we must determine a theoretical distribution for 
the marginal distribution of the y-variate. In some cases it will be sufficient 
to assume it Gaussian and then the table of the probability integral will suffice. 
In other cases it will be advisable to determine a skew frequency curve. But 
as a rule it will certainly be needful to graduate the array frequencies by some 
process, and not to assume them given by the observed marginal frequencies f. 
The bringing of the determination of 7, into line with that of m, does not seem 
therefore to present great difficulties. 

(ii) We need to determine oj. If the frequency surface be homoscedastic, 
then o?; = @,?(1— 7,.,), if »,., be the correlation ratio of z on y, and the 
regression be skew. But if the regression be both homoscedastic and linear, then 
"i, = %* (1 — 7,,), where 7,, is the correlation coefficient. In these two cases 
we may write respectively 


Co 


1 
pees S 7” — m,)3 
== = S {n'’, (m, — M,)*}, 
a,” (1 hia Ue v) 
* Biometrika, Vol. 1x. p. 10, with the necessary changes in notation to fit the notation of this paper. 
+ Extreme arrays here again form an exception. 
t A precisely similar difficulty arises in working the ordinary expression for mean square contin- 
2 
gency, ie. 1+ ¢?= 8 (ne ) » where %,. and 7., are the marginal frequencies (reduced of course in 
ps eg 
the proportion of size of sample to population) in the sampled population, not in the sample, although 
we ultimately use their sample values. There is more justification, however, in this use, for contingency 
is usually applied to broad categories and in such cases we have, perhaps, 3 to 7 marginal groups only ; 
there is thus relatively less fear of big irregularities in 7p. or 77.q such as arise with the small arrays of 
regression lines. 
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and Y= aa) S {n'’, (m, — M,)*}, 
which admit of easy calculation when once we have determined that characters 
of the sampled population m,, 71,, G,, Fr OT Hr.y may be replaced by the like 
characters calculated not from a single array but from the whole material of the 
sample. I am inclined to believe that 
se (Nz.v)” = (te. v)? 

1— 72. 
would be an effective measure of x?, where, however, we must not put y,., = Mr.y 
in the numerator for we are actually measuring the improbability of the deviations 
of ny. from 7,.,,, i.e. of m, from m,. But such a form would be of little use unless 
we knew the sampled population. It indicates, however, how risky is the problem 
of replacing only certain of the sampled population values of the constants by 
those of the sample. 


x 


In any given case it appears best to draw the scedastic curve, i.e. plot oy, to 
Y,- This curve may be fitted with either a straight line or parabola of the second 
order, according as the array variability decreases or increases in one or both 
directions from some fairly central array. This is perfectly easy, but care must 
be taken to weight the stan'ard deviations of the arrays with their frequencies. 
Very rarely as I have pointed out previously are the data ample enough to justify 
anything but a linear scedastic curve. If this line be practically horizontal then 
we can take o?;, = o,”(1 — 4,.,) throughout. I propose to illustrate the whole 
process on the two examples selected by Slutsky. 


Illustration I. Prices of Rye in Samara. 
The example is given in Slutsky’s recent paper* and is as follows: 
Correlation between prices of rye at monthly intervals. 


Price of rye in Samara a month earlier 














|Copecks} o5 | 39 | 35 | 40 | 45 | 50 | 55 | 60 | 65 | 70 | 75 | Totals 
| per pud | | 
| | 
| 26 3 5 BT eth ee fl ewe 4d te Gd ae ae be fe 9 
| 30 6 | 13 Shawl wf me | co ae oe eee 2i 
35 me 3 wl eee See ee ee ee ee 8 
40 oe 1 | 2 eee Pes ees Pe se 4 
45 aay ae 1 | 2 | 10 “Bes a ae ee 15 
50 —};—-—|-|- 2 19 4|1 ae oe a 26 
55 we en 3 2 | 25] 15) — 0 
60 —|—| — | — | Rete ee 12: 
65 —|—- —|- — l l 35 | 5 1 | — 11+ 
70 ey OS ee ee ee ee ee 4/1 6 
75 ae eee ee ee ee ee ee 1/1 2 
| = | 
Totals | 9 | 21 | 5 4 | 15 | 2% | 9 |125|125| 6 | 2 | 124 














* “On the Criterion of Goodness of Fit of the Regression Lines and on the Best Method of Fitting 
them to the Data.” Journal of the R. Statistical Society, Vol. Lxxvit. p. 81. 
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The data are not very suitable, but I use them because Slutsky has done so. 
Their unsuitability arises (i) from the smallness of the total population, but 
especially (ii) from the marked signs of heterogeneity obvious in the marginal 
totals. If y be the vertical, z the horizontal variate I obtain, assuming that the 
column headings represent central values, the following constants for the distribution : 

m, = 47-1572 copecks per pud, o, = 13°7769, 

m, = 47-0363 copecks per pud, o, = 13-6852, 

Try == °953,545, 

Regression line: y, = -947,205a -+ 2-3687. 

My values of o,,0,,17,,, and of the regression line constants, differ very considerably 
from Slutsky’s. I have not used Sheppard’s correction which would, however, 
only have emphasised our differences. ! have not reworked Slutsky’s first method 
with these changes because I do not think it is the correct method, i.e. he uses 
observational values for oF,» and thus I cannot say whether the above values 
would improve his bad fit (i.e. P = 0-02), but I have adopted my own value 


o*,/ (ie (1 — =] + 2), 


using for oi, the mean value o,? (1 — 7?,,) = 16-9965. Slutsky takes for the 
value of o;,,, on the assumption of homoscedastic arrays, the mean of the observed 
standard deviations of the arrays, i.e. 

1 , 
m= S (Njoy,) = 38022 
according to his vaiues, or he takes 

0%, = 14-4567. 

This value is, I think, theoretically incorrect, the mean value of 0”, = 0,?(1—1?,y) 
and this must be the homoscedastic value. Clearly Slutsky’s value is too small. 
‘Phe point remaining is the value of 7,. I should naturally determine it from 
the frequency curve for the marginal z-totals, but the extreme irregularity of the 
marginal x-totals—due partly to paucity of data, but more to probable hetero- 
geneity-—makes any such process unsatisfactory. I have therefore taken i, = to 
the observed array frequency a result with which I am thoroughly dissatisfied, 
but which appears to be the only course. We have then the table on p. 253. 
256-81196 


— — 15-11. 
16-9965 


Thus x” 

Looking this out in the Tables for Goodness of Fit we find for x’ = 11 + 1 = 12, 
P = +18. 

Thus we see that the fit is passable, although not brilliant, much better than 
Slutsky’s P =- -02. 

Slutsky also gives by his second method x? = 15:1 and P = -18 for the fit on 
the hypothesis of homoscedasticity, but I think this can only arise from a curious 
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” 2 
zarray | % p= (1 = x) +2 | Observed m, |Theoretical mp | (mp — my)? n'y (Mp — My)* 
= Sth: eR oon 

25 | 10-8548 28-333 26-049 5-2167 56-6262 
30 | 22-6613 29-523 30-785 1-5926 36-0904 
35} §-8710 34-375 35-521 1-3133 12-9636 
40 | 5-9355 42-500 40-257 50310 29-8615 
45 | 16-7581 44-000 44-993 -9860 16-5235 
50 26-5967 50-800 49-729 1-1470 30-5064 
55 10-8548 | 55-500 54-465 1-0712 11-6277 
60 14-2984 59-600 59-201 +1592 2-27631 

| 65 14-2984 | 62-200 63-937 30172 | 43-1411 

} -70 7-9032 70000 | 68-673 1-7609 | 13-9167 

| 7 3:9677 | 72-500 | 73-409 -8263 3°2785 





S {n”, (my — M,)?} = 256-81196. 


balancing of errors. For (i) his theoretical m,’s differ considerably from mine 
owing to the divergence in our values of standard deviations, correlation and 
regression, and (ii) he has used a different value to mine for 7,, and what I believe 
to be an erroneous value of o*; —namely too low a value. It appears to be thus 
a mere chance that we should reach the same result. 


Illustration II. Auricular Height of School Girls. 


Slutsky takes this example from my memoir on Skew Correlation*, and it is 
a peculiarly good illustration for the following reasons: 


(i) The regression line is distinctly skew. In my paper Y,, the mean deviation 
in auricular height from the mean auricular height of the general population of 
girls who differ from the mean age of the general population by X,, is given 
by the cubic 


~ 


Y,, = -296,076 + -722,886 x X,, — -029,580X,? — -002,223X,3, 


and the goodness of fit of this regression line is to be tested. 









































Age ... | 3-4| 4-5 | 50 6—7 78| 89 | 9-10 | 10—11 | 11-12 | 12—13 | 13—14 
eres ree | 18 ] 40 6 | 152 am | 235 | 261 | 309 | 263 | 
| Age ... | 14-15 | 15—16 | 16—17 | 17—18 | 18—19 | 19-20 , 20—21 | 21-22 | 22-23 
| Frequency | 198 | 214 | 162 |}el/alpi|s | ete. 











This is not normal. We find for its constants: 
Mean age = 12-7007 years, 8, = -001,335, 
o, = 3°064,819 years, 8B, = 2-710,593. 
The probable error of f, is about -045 and thus f, cannot be the result of sampling 
from Gaussian material. 8, is sufficiently near zero to mark the distribution as 
substantially symmetrical. 


* Drapers’ Company Research Memoirs. Biometric Scries 11. p. 34. Cambridge University Press. 
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The distribution can accordingly be described by a symmetrical limited range 


curve of my Type II: 
z2\m™ 
Y= % (1 a 7) . 


The values of the constants being found we have 


= 2 7415 taal Ze 55) 7°866,024 
y = 283 7415 {1 = 


This curve was drawn on a large scale and the frequencies for each age 
integrated with the following results: 








Frequency Frequency 
Age l Age r a OER. 
Observed | Calculated Observed | Calculated 
3—4 i 1-5 13—14 263 | 275 
4—5 7 7-5 14—15 198 | 246 
5—6 18 19 15—16 214 | 197 
6—7 40 41-5 16—17 162 | 143-5 
7—8 76 | = 76 _ 17—18 9 | 915 
8—9 125 | 123-5 18-—-19 61 53 
9—10 177 | 178 19—20 13 | 26 
10—11 235 | 227-5 20—21 7 | 9 
11—12 261 265°5 21—22 8 | 7 
12—13 309 282-5 22—23 2 | 1-5 











The goodness of fit of the calculated to the observed array frequencies was tested. 
I found x? = 25-4769 giving P = -146, or such a sample would occur about once 
in seven trials. The fit therefore is a fairly reasonable one, and the above values, 
not the observed ones, have been used for the array frequencies. Clearly they 
effectively smooth the random sampling. : 

(iii) The standard deviations of the arrays have been given by me™*, and it 
has been shown that the arrays are very far from homoscedastic. The value 
of 7 is -303,024, which combined with o, gives for the mean square standard 
deviation of the arrays in squared 2 mms. units 

6,” = 10°835,433. 
I now somewhat diverged from the plan of my memoir on skew regression. 
I sought the best fitting straight line to the weighted squares of the standard 
deviations, i.e. 1 made 
u= 8S {n, (o,? — Ax — B)*} 
a minimum, where », is the frequency of the pth array of auricular heights for 
girls of age x, and ga, is the standard deviation of this array. In working this 


I omitted the first and last arrays as quite unreliable. I found A =: -436,706, 
giving for the line 

og,” — G,” = *436,706 (a — #), 
or o,2 == -436,7062 + 5-290,970. 


* Loc. cit. p. 34, Table and Plate I, Diagram I. 
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From this equation the squared standard deviations of the arrays were calculated 
with the following results, the unit being 2 mms. : 














Squared Standard Deviation Squared Standard Deviation | 
Age array a Age array | 
Observed Calculated Observed | Calculated | 
3—4 [0?] 6-8194 13—14 11-2822 | 11-1865 
4—5 8-3250 7-2561 14—15 12-8630 | 11-6232 
I—6 8-5708 7-6929 15—16 12-0152 12-0599 
6—7 8-7859 8-1296 16—17 14-9738 12-4966 
7—8 8-9293 8-5663 17—18 10-0356 12-9333 
8—9 6-9517 9-0030 18—19 9-7563 13-3700 
9—10 11-4765 9-4397 19—20 23-4314 13-8067 
10—11 8-7930 9-8764 20—21 6-4065 14-2454 
11—12 10-2970 10-3131 21—22 | 17-1512 | 14-6801 
12—13 10-2791 10-7498 22—23 | [91662] | 15-1169 | 





The calculated results give a reasonable graduation of the somewhat erratic observed 
values, and these values were used in evaluating the formula 
—"r an NO 
«. gol* o(*®,— &P 
= Siete 
o Np 
The following table now gives all the required data for the calculation of x?, 


and we deduce x? = 9-5845 corresponding to P = -974 for n’ = 20+ 1= 21 as 




















argument. 
Regression Curve for Girls’ Auricular Height with Age. 
2 | Wi’ (My — Iity)® | 
= ssi ol . : ny, (Mp — My)? 

Age group Ny "= in(1 - x) 2 | mp)* m,* (mp — m,)** oat = = hi | 

ee: Ree, een wae mae’ ices | 
3—4 1-5 3:4987 115:25 | 116-90 | 2°7225 6-8T94 13968 
4—5 7-5 9-4934 116-96 117-66 | -4900 7-2561 ‘6411 
5—6 19 20-9833 117-47 | 118-42 -9025 7-6929 2-4617 
6—7 41-5 43-4635 119-10 | 119-24 | ‘0196 8-1296 -1048 
7—8 76 77-9331 120-30 | 120-08 | -0484 8-5663 -4403 
8—9 123-5 125-3913 121-63 | 120-93 | -4900 9-0030 6-8246 
9—10 178 179-8433 121-72 | 121-78 0036 9-4397 -0686 
10—11 227-5 229-2997 122-82 | 122-62 | _-0400 9-8764 -9287 
11—12 | 265-5 267-2663 123-14 | 123-42 | -0784 10-3131 2-0318 
12—13 2825 284-2513 123-89 | 124-18 | 0841 10-7498 2-2238 
13-14 | 275 | 276-7579 124-86 | 124-88 | — -0004 11-1865 0099 
14-15 | 246 247-7834 | 12571 | 125-52 | —-0361 11-6232 -7696 
15—16 | 197 | 198-8266 | 126-16 | 126-07 | —-0081 12-0599 1335 
16—17 143-5 | 145-3737 | 126-53 | 126-52 -0001 12-4966 | 0012 
17—18 91-5 | 93-4195 | 126-91 | 126-87 | 0016 12-9333 -0116 
18—19 53 54-9533 | 127-02 | 127-09 0049 13-3700 0201 
19—20 26 27-9771 129-56 | 127-18 | 5-6644 13-8067 11-4780 
20—21 9 10-9921 | 123-82 | 127-11 10-8241 14-2434 8-3533 
21—22 , | 8-9938 | 126-50 | 126-88 | -1444 14-6801 0885 
22—23 15 | 3°4987 | 125-25 | 126-48 1-5129 15-1169 +3502 

| | | 

Paee a cag ae — | Total y* = 38-3381 











True y? = }y”? = 9-5845 
For values of m, and ni, (cubic (c)) see loc. cit. p. 37. 
Hence x? as given above must be divided by four, to obtain true value. 


* In millimetre units. 
+ In 2 millimetre units. 
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We are now in a position to appreciate the influence of the various factors in 
the solution for this case. I hold that as m, must be given the theoretical or 
calculated value, so also must the standard deviations of the arrays, i.e. o,?. But 
as we do not know the sampled population in this case we must obtain o,? in 
precisely the same manner as we find m,, i.e. not by taking a value obtained from 
the individual array*, but by values graduated from the whole sample. Further 
it does not appear correct to take o,? = 07; /ii,. The latter is only true when the 
number in the array is a priori fixed, that is to say is not itself provided by the 
random sampling. In the latter case we must take a, = 07; /n’p, where 


i”, = i (1 2 x) +2, 
to a second approximation and this modifies considerably the standard deviations 
of the small frequency arrays. If in the above example we use 7i,, the theoretical 
frequency of the array, instead of the value 7”, we find, still using the theoretical 
o*;,, x* = 86166 giving P = -986. Slutsky*, who has used both the observed 
frequencies and the observed standard deviations} of the individual arrays, 
finds x? = 9-17 and P = -980. It will be seen that the correction for 7”, is 
the more important factor. In this case no marked changes are produced 
by using observed quantities instead of graduated values, but I think that in 
short series very fallacious results might be reached by this process, and the present 
paper is written to suggest caution at this point. 7 


» and o?;, are as definitely 


at m, values in the sampled population, not in the sample itself. 


Application of “Goodness of Fit” Theory to testing Physical, Technical or 
Astronomical Measurements. 


In these cases there is no question in the ordinary sense of a frequency surface. 
The physicist makes a few measurements of a variate A for each of a series of values 
of a variate B. He plots the mean of his measurements for A to each value of his 
variate B, and he enquires whether the curve given by his series of mean values for 
A is closely approximate to some theoretical curve. It will be seen that his problem 
is very similar to that of the statistician. He has a number of means for the A 
variate, m,, Mm... M,..., and he considers whether they are good fits to a theoretical 
curve—the statistician’s regression curve. Obviously in this case these means are 
non-correlated as approximately in the statistical case. Further the variability of 
a mean will be given definitely by o,,/V%,—not now as an approximation. Here 


fi, is the number of observations in the array on which m, depends while o;, is 


* Loc. cit. p. 81. 
+ In the case of the 3-4 array of one with observed ¢, = 0 Slutsky says this is due to random 
sampling and extrapolates a standard deviation; this is of course only a first slight step towards the 
proper graduation of the whole system of array variations. In the case of the array of two for 
23-24, the observed value is ¢, = 1-9148, which is just as much an inconsistency due to random 
sampling, but this value although about } of the real value is retained and used by him. 
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the standard deviation (mean square error) of an indefinitely great number of 
measurements which he might make of A for the given value of B. He attributes 
these variations in A to “errors of observation,” and he usually supposes such 
errors to obey the Gaussian law. This attribution is somewhat dogmatic. There 
is very little definite proof that errors of observation actually do obey the Gaussian 
law and secondly his “‘ errors” are not in all probability solely due to observation. 
It is impossible to repeat each experimental measurement under precisely the same 
physical conditions for other variates C, D, EF, ... and changes in these variates may 
be as influential as personal errors of observation. Further it is far from certain that 
the value of B has remained without some variation and this alone would tend to 
cause some variation in A. The physicist aims at a constant value of B—it is by 
no means certain that he always reaches it. Without much more investigation 
than is easy, or is at all likely to be made, we probably can at present do no better 
than assume the distribution of A’s for constant B to be Gaussian. Thus the distri- 
bution of means determined by the physicist will have for its distribution surface 


ia 
2=> Ze 2x ’ 
> a (Np (My, — M,)? 
where x= 8 a" Som ») } , 
a. 
Rp 


and the coiresponding value of the probability to be taken from the “‘ goodness 
of fit” tables will be found by taking out P for the given value of x? under the argu- 
ment n’ + 1, where n’ is the number of arrays, for there are now n’ independent 
variables, i.e. m,’s. 


In the above value for x?, m, is the theoretical value of A corresponding to 
the given B, m, is the observed value and n, is the number of observations on which 
it depends. The real difficulty arises in determining cq, which is the standard 
deviation of the array for an indefinitely large number of observations and cannot 
be determined properly from the few observations made to determine the A 
corresponding to a given B. 

If there are a considerabie number of observations in an array and o, be the 
standard deviation of the array found from the observations themselves, then 
it is well known that the “best” or most probable value of oj, is given by 


n 


2 = Dp 2 
Oe eo 
. N, — 1) (m, — m,)? 
In this case x= 8 [= BS ip — My) t, 
ot 


a form which shows us that if we have arrays with only a single individual and 

use the observed o,,’s, x? will be indeterminate, for n,— 1=0 and o*®, =0. But 

the observed o, would be very risky for any system of small arrays, and this 

method of approaching the difficulty must I think be dropped unless the physicist 

be inclined to increase very much the number of measurements he makes of A for 

a given value of B. But the above method of approaching the subject indicates 
Biometrika x1 
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how a single value of A for each value of B cannot lead to any measure whatever 
of the “goodness of fit.” I suggest the following method of determining a suitable 
value for o;,. Let us assume that the arrays are homoscedastic, i.e. that the 
physicist’s difficulty in measuring A is the same for all values of B. This will not 
always be true, but is a fair working hypothesis for many cases. Then we have 
on, = 047 (1 — 974-8), 

where oy is the standard deviation of all the measurements made on A (without 
regard to the value of B), a much more reliable quantity than the standard 
deviation of any array values of A. Further 7,4., is the correlation ratio of A 
on B or 


2 _ S{n,(m, — m)} 
AB >= No,? ? 
where m is the mean value of all measurements of A and N is their total number. 
Thus 
S {n, (m, =s m,)*} 
x a4? (1 — 7 4.x) 
can be readily determined as soon as m, o4 and 4. have been found. 


Of course it is needful for a test of this kind that the number of measurements 
of A should considerably exceed the number of values of B tested. It would fail 
entirely if only one value of A were taken for each value of B, however numerous 
the latter might be. We must have some basis on which to determine the error 
made in single determinations of A. This is a point I think often overlooked by 
the physicist. A fairly good determination—I mean a quantitative determination— 
of the goodness, of fit of theory to observation could be made from 10 series of 
8 observations of A corresponding to 10 values of B. But no measure of “ goodness 
of fit” could be found from 80 observations of A corresponding to 80 values of B, 
and yet the latter system would probably make the greater appeal to most physicists. 
I do not see how quantitatively to obtain any measure of the goodness of fit of 
theory to observation in the latter method of procedure. It is not unusual to 
determine the mean square residual, i.e. 

S (m, — m,)?/N, 

but before we can really make use of this, i.e. find x? and so P, the probability of 
as great or greater divergence between theory and observation, we must know 
on, and this can in no way be deduced from such a system of observations. In 
fact without a knowledge of o*; —the unit in which (m, — m,)? is to be measured— 
the mean square residual is as delusive as the ocular comparison of a graph of 
the theoretical and observed results, where all turns on the arbitrary scale of the 
vertical ordinate. 


Illustrations. As illustrations I will take some of the data connecting length 
of are with loss of carbon per coulomb provided in a recent memoir by Professor 
Duffield*. My only reason for taking this material is that it is recent work and 


* R. 8. Proc. Vol. 92, A, p. 125. 
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that I was as a mere statistician struck by the appearance of the curves as I turned 
over the pages of the Proceedings. I have considered the cases of 2 and 4 ampéres 
for the anode, Diagram I. The data for individual experiments are given on 
Prof. Duffield’s diagrams, but his tables only give mean values. I have therefore 
had to measure off individual values from his diagrams. The means obtained 
from these measurements agree in the main with the author’s recorded means, 
but in a few cases it would appear that there is some discordance between table 
and diagram. On the diagrams curves are drawn presumably representing in 
each case the smoothed relation which the author considers to hold between loss 
of weight and length of arc. My object is to enquire how far these curves are 
probable representations of the phenomena in question. It will be seen that the 
observations are usually made in pairs, more rarely in triplicate or are even 
quadruple. There are a considerable number of isolated observations. We have 
thus rather slender material from which to determine the error of a single observa- 
tion, and clearly but little conclusion could be drawn did we not assume that a 
measurement at each arc length was of equal weight*. 

The following table exhibits the work—first for the 4-ampére curve; the 
weight lost being per coulomb x 10° grammes. 



































| | 
A Weight | _ e ba 
eas | Np lost Wy my | Mp Wp - wv | My —- w | Mp —- mp, | S {Np (mp— Mp)*} 
a See eS See ee Geer eee) ee ee ee Se ee (ne 
| | 
l 1 17-62 17-620 17-52 | - 3-32 | -332 | +-10 -0100 | 
2 \| 17-92), ae: - ae | ai | pe = 
ci oa | 18-24 | _ 5.64f | - 28% ee... 0338 
. Sty 19-19) 405 AS - 1-75) | 5 25 | 
s } 2 19-62 f 19-405 | 19-15 _ 1-32 | - 1-54 Be 25 1250 | 
sot. 19:70) | op. go” - 1-24) 08 | z 05 | 
st 2 ao-gof | 20010 | 2012 | — “G5 } - -93 | = o242 | 
5 1 21-43 21-430 | 21-29 {| + -49 | + °49 |} +:14 | 0196 
7 l 21:35 | 21-350 | 22-26 | + 4] + 41 - 91 | 8281 
8 22-40 eee ae + 1-46) pine ee | 
: } 2 | S303} 22-515 | 2251 t ioe 1:57 00 0000 
0 1]. eS a Po ae + -43 ‘a an 
° tl 2 soagt | 22130 | 22-71 ba 1-93} + 1-19 ~ +58 6728 
2 kl 22-55) 99. 99.79 | «++ 1:61 | , 08 : 
~ S| 2) gao7y | 22810 2279 | F913 |} +27 | +08 0008 
15 1} 22-69 22-690 22:79 | +175 ; +175 | --10 -0100 
20 1 | 22-89 22-890 22-79 | +195 | + 1-95 + -10 -0100 
a | SNe ER, Serre e mond : 
Sum of squares | S{n,(m,—- w)*} 1-7343 
Totals | 17) 366-04 - — | 257-2166 | =55-5422 — | =8{n, (mp - iy)*} 
ae op nee od | | ane 2(] 2 
w=20-94 — — o4° = 33657 | 2 = 3-2672 — pe x* = 17-607 
| | | 
a ee ose | 
’ oes P a pe 6 e >... oe _ > = 9709 
vV=an4+ b= 1H+h=B, .. P= COR n* = —, = ‘9708 
| Oa 

















* This is probably not true, but there is no evidence of multiple measurements at points where the 
measurements might be assumed to be critical. 
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Thus the probability, if the inscribed curve represented the phenomenon, that 
a system of observations as unfavourable or more unfavourable than the observed 
would arise is about 1 in 11, i.e. in eleven trials we should have had one result as 
bad. This is not very great odds (10 to 1) against the inscribed curve describing 


the facts, but it cannot be called a highly satisfactory concordance. 


I now take the anode 2 ampére’s data. The work is given in the following 
























































table: 
Are Weight lost, — e : | — 
length | %p Wp “ | oe Wp - w Mp — Ww M, — Mp | Ny (My — Mp) 
1 25-31 | ~ 4-80 | 
. \ 2 26-04} 25-68 25-68 407} - 4-43 + 02 | 0008 
2 26-04 — 4:07 | 
ra 3 2077| 26-59 | 26-46 ~ 3-34 - 3-52 + 13 | -0507 
” 26-95 - 3°16 | 
3 26-83 rar - 3-28 | 
; 2 28°30 | 27-57 | 27-68 ta aly — 2-54 - hh | -0242 
93) sen | dine ~ 1:18 ’ a | } 
* +] 2] ao1sy [2904/8073] 26} ~ 1-07 - 169 | 5°7122 
6 aes | + 3-89 
pa . . +417 
: 4| 34.89 34:57 | 34-46 . as + 4-46 + ll | 0484 
; 33.98 | $387 | 
, .49 | 34- + 3-87 ’ ‘ 
n } 2] 35.00 34-49 | 34-61 : rey | oom fe 0288 
Set =¢ SEAS” Cees Ce. Rae lee! | _| 
| } 
1. Po | ___| Sum of squares | S {n, (m,- )?} | 58651 
Totals | 15 | 451-58 | = 219-99 | ia Mon 1 | _ | = 84(n, (m,- mF" 
7a Se ee Bi ee veal 
» = 30-11 | 2 | 2_ ]}5 ‘ lo. (1-n)| 2 2 
® = 30- ~~ ae 14-1947 | B* = 13-9673 | °* ‘9974 x? = 25°72 
, P 2? : 
n’=n+1=6+1=7, .. P= -0003 n? = —, = +9839 | 
C4 





Thus only three times in 10,090 trials if the inscribed curve actually represented 
the phenomenon woul:: series of observations so widely divergent as those 
observed arise. In other words the inscribed curve must be definitely rejected 
as a probable description of the series of phenomena. Now the great advantage 
of this “goodness of fit” method is that by the very working of it out the actual 
regions at which the theoretical results diverge with a maximum of improbability 
from the observations are indicated, and the investigator is able to say here are the 
points where discordance is greatest and where theory or observation needs modi- 
fication. Clearly in this case the whole burden of the discordance falls on the 
observations at arc length 4. The special examples selected are of no real import- 
ance; the author probably laid no stress on his inscribed curves, and a little better 
draughtsmanship might have bettered them to some extent. They are used simply 
to illustrate that a new instrument is ready to the hand of the physicist. He must 
have felt very desirous at times in the past of being certain how far his observations 
were in accord with theory. How many times must he not have*put to himself the 
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question: What is the probability that my theory describes adequately my facts? 
Meanwhile the statistician who examines the physicist’s diagrams has often been 
forced to smile by the degree of discordance which the physicist has allowed to 
pass as if some graph could demonstrate adequately the harmony of physical law 
and experiment. I look forward to the time when no physical paper will be 
considered complete unless it provides at the end of each series of experiments the 
value of P, i.e. the measure of the goodness of fit of observations recorded to 
theory adopted. There will be no excuse, there really is no excuse, now that 
tables are provided, for its omission. It is always possible in the course of an 
hour or so’s arithmetic to measure the accordance between supposed law and 
recorded observation. 


One of the important steps in the work given is undoubtedly the measurement 
by means of the correlation ratio of the mean square error of the physicist’s 
determinations. I think that is undoubtedly the best way of finding it. The 
physicist is apt to use “mean errors” instead of mean square errors. He does not 
recognise that the probable error of a mean error is sensibly greater than that 
of a mean square error. But he may wish in the present case to have some test 
of the accuracy of the determination of o,, the standard deviation of an array, 
from other processes. One which it appears to me ought to be satisfactory is the 
following: Let the observations of A for a given value of B be taken in pairs, 
and let all these pairs of A values be formed and their differences 7, — z, be taken 
in each case, x, being greater than 2, then the following relation should hold if 
the distribution of errors be Gaussian: 

mn Mean (x, — 2)? — (Mean (2, — 2,))? 
' 72676 , 
For Professor Duffield’s 4-ampére curve this gives 
o,? = ‘2444 against o,? (1 — 7?) = 0977, 
and for his 2-ampére curve 
o,2 = 1987 against o,? (1 — 7?) = -2274. 

The latter, considering that we deal only with 13 pairs, is fairly accordant; the 
agreement in the case of the former is very poor, but in this case there are only 
siz pairs. I have purposely introduced this case, because no real verification of 
the 7 method could be obtained on the basis of six pairs only, and yet for the 
4-ampére curve the whole question of whether the graph is a reasonable description 
of the data actually depends on the existence of six paired observations in the 
total of 17. We are really left without adequate material to determine effectively 
the probable error of any observation. This may be of no importance in the 
present case, but the absence of adequate repetition of the value of A for a given 
value of B in order to determine the probable error of observation and so the 
“goodness of fit” of observation to theory is characteristic of much current physical 
research. 

I am much indebted to Mr Andrew W. Young for algebraical and arithmetical 
aid. 











ON THE ‘BEST’ VALUES OF THE CONSTANTS 
IN FREQUENCY DISTRIBUTIONS. 


By KIRSTINE SMITH. 


(1) If we attempt to fit the normal or Gaussian curve to a system of observa- 
tions, we almost invariably determine the constants % and o of the equation 


N 1 (x -z)2 


 _— : 
by the method of moments. This method of moments has been extended by Thiele, 
Pearson, Lipps and others to obtain the constants involved in various skew 
frequency curves and series. It is an undoubtedly utile and accurate method; 
but the question of whether it gives the ‘best’ values of the constants has not been 
very fully studied. It is perfectly true that if we deal with individual observations 
then the method of moments gives, with a somewhat arbitrary definition of what 
is to be a maximum, the ‘best’ values for o and @ in the above equation to the 
Gaussian. Pearson* has shown that the method of moments agrees with the 
method of least squares in the case where the distribution is given by a high 
order parabola, and accordingly the method of moments is likely to give a very 
good result, when an expansion by Maclaurin’s Theorem would closely give a 
frequency function. But the method of least squares itself can now-a-days hardly 
be spoken of as more than a utile and accurate method of fit, indeed its utility, 
owing to the cumbersome nature of the equations which frequently arise, is often 
far less than that of the method of moments. 


Gauss’ original proof that the probability of the observed individual results 
was a maximum when @-and o have been determined by moments has led to the 
extension of the conception that for grouped data, and for other results than the 
Gaussian curve, the ‘best’ values of the constants must be given by the lowest 
possible moments. This is of course not true For example, if we had as fre- 
quency curve 





_1(@-39 
4 
y=ye* ” , 
and used individual observations, then the Gaussian ‘best’ value for % would be 
that found by determining the point for which the third moment coefficient 


* Biometrika, Vol. 1. pp. 267-70. 
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vanished, and the ‘best’ value of « would be determined by o = {/p4, where py, is 
to be taken about the point for which p, = 0*. 


From another standpoint, however, the ‘best values’ of the frequency constants 

may be said to be those for which 
x= Ss (n, — ii,)? 
Ns 

is a minimum, where n, is the observed frequency and 7, the theoretical frequency 
of the sth group. For when x? is a minimum then P, the probability of occurrence 
of a result as divergent as or more divergent than the observed, will be a maximum, 
or the frequency constants will have been so chosen as to make the probability 
P of results, as divergent from theory as the observed data occurring, a maximum. 


It sounds somewhat paradoxical, but it is none the less true to say that the 
“best value’ of the mean is not necessarily the mean value, nor the ‘best value’ of 
the mean square deviation necessarily the mean square deviation{. I shall illus- 
trate this in the following five cases: 


I. Fit of a normal curve to unilateral data. 
II. Fit of a normal curve to bilateral data. 
III. Fit of a Poisson limit to the binomial. 
IV. Fit of a binomial to binomial data. 

V. Fit of regression lines. 


The general method is as follows. Suppose f to be any independent frequency 
constant; then x? is to be a maximum with the variation of f. Accordingly we have 
from 


/y, 2 
1+ x*=8(%) 


* University of London, Honours B.Sc., Papers in Statistics, Thursday, Oct. 28, 1915. 

+ Phil. Mag. Vol. u. p. 157, 1900. 

{ There is a point of some philosophical interest here which deserves further consideration. As is 
well known the Gaussian demonstration depends on making the product 


1 (as—Z)* 
Pf : e2 a |. 


s being taken so as to include each individual observation, a maximum by varying o and Z, the result 
being that the ‘best’ values are found from the first two moments. Now it will be observed that this 
is not the same idea as lies in the x? test of goodness of fit. The conception of ‘goodness’ in that case 
is that we should measure the probability of a drawing from a certain population giving as divergent 
or a more divergent result than that observed. In other words while the Gaussian test makes a single 
ordinate of a generalised frequency surface a maximum, the x? test makes a real probability, namely 
the whole volume lying outside a certain contour surface defined by x? amaximum. Logically this seems 
the more reasonable, for the above product used in the Gaussian proof is not a probability at all. To 
make it a probability it must be multiplied by the product {5z,}, and then the probability of the actually 
observed result, namely 2,, 22, ... 2g, «.. %q, Will of course be infinitely small, and what is made a maximum 
is an infinitely small probability. The exact meaning of P {5x,} when x, is an actual observation is 
obscure, but it appears that the probability for constant indefinitely small ranges of the variates in the 
neighbourhood of the observed values is made a maximum. But probability means the frequency of 
recurrence in a repeated series of trials and this probability is in the case supposed indefinitely small. 
It seems far more reasonable to make a finite probability, i.e. the probability of a divergence as great or 
greater than the observed a maximum, i.e. to use the x? test and not the Gaussian principle. 
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a number of equations of type 


8 ( 3) WP terse ty ee (1). 


These equations will generally be far too involved to be directly solved. Accord- 
ingly we proceed thus: We suppose that the values of the frequency constants 
given by the method of moments are good starting-points, and we put, if f denote | 
the moment value of a frequency constant, f= f+ Af. Accordingly if there be 
a number f,, fo, ...f, of independent frequency constants, we shall have a series 


of equations to find Af,, Af, ... Af, of the type 
di,\? 
LG) [pan 


o- stele + Staa(lae] ~ a 


+e (i)-EG BD) 


Poe e eee ee PPC PCC ECC CEC e CeCe eC CeCe ere error errr erry 


+8183 (leat) ~ a Lar ae) Me > 


where a square bracket rounc the differential coefficients signifies that the frequency 
constants f,, fy, ...,f, therein are to be given their moment values f,, f,, ...f,. These 
values are of course also to be used in 7,. 


Since S(i,) = N, it is clear that 
(dis) _ (Ms) 5 (Be) — ot, 
a a eet 


Accordingly the above equations may be reduced to the type 


oma Pet [Bl +8 a ea] BS (Gy) 


n?—i,2{ di, ]- pl: di, di, 
' in ni,” laeae| Ls E at Afe 


— nf di, 2n,* dn, di, 
+ 9" Lael a Lat al} 
It might beware | be anticipated that terms involving the product of Af 


and (n,? — 7,”)/7,2 could be neglected in the first place and accordingly that we 
should have as approximate type 


ase” ail} sli) | 96+ Stalag} 
+ + BLES EB a |t Af cecal (2b), 


but this approximation has not in every case numerically justified itself, and thus 
it cannot be invariably used as more than a reasonable starting-off point. 


(2) Fit of a Normal Curve. 











Differentiating 


ae 
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and then putting 
N —1(%.-m)* 
a= Fane 2° o and z,/o=h,, 


ih . | 
+ Am (825 asrtoen — helt + 2078 (25 [2a ~ 2) 
2 2 
a Ac (s {es {(- 24 t 2s W243 — hee} + 2NS 1m [Zs41 = Zs] [hss1%st1 ‘cy healt) , 
Sais "  % 
2 
0=a8 1 [hs1%s41 <y healt 


2 2 
+ Am (s ey [- Zs44 + % + W424 — hex} +2NS8 1 [Zs41 = 2s] [hsi12s41 a helt) 





/ 2 ) 2 
+ Ao (s ee ((A3,412s41 v2 hSz,) ‘i 2 (hsir%e41 ay haz)I a 2NS 1m [hess e541 heat) 


the differential coefficients of x? being 





d(x?) N .(n,? 
; i te 1 (2541 — a} we 
) ON (et. tte ; 
and d iy a S S {Ps (h, +1%s+1 os heed} 


Illustration I. Fit of Normal Curve to Unilateral Data. 


Our first illustration treats a series of measurements by Bradley discussed 
by Bessel*. The mean of the observations is fixed, for in dealing with the observa- 
tions Bessel has added positive and negative variations together. 


TABLE I. Measurements of Right Ascension. 

















| | 7 Gaussian curve | 

Limits Observed | Gaussian curve improved by | 

H by moments minimum x 

| = tt ote | 

| 9.0071 114 101-61 98-63 

0”-1—0”-2 84 84-12 82-59 

0”-2—0":3 53 57-65 57-91 

| 0”-3—0"-4 24 32-71 34-00 

} 0”-4—0"-5 14 15°36 16-72 

| 0" 5—0"-6 6 5-974 6-881 
0”’-6—0”:7 3 1-923 2-372 

|  o”-70"-8 1 5122 6843 

| 0”-8—0”-9 1 -1370 -2053 











* Emanuel Czuber, Theorie der Beobachtungsfehler, p. 192. Search has been made in vain in the 
Fundamenta Astronomiae for the original data in order to remove the unilateral limitation. 
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@ was found equal to 2-282542 and the second formula of (3) gave the value 
2-341735 for o. As Ao was found so large that the approximation could not be 





; d (x? 
expected to be very good, the following values of =f) were calculated from the 
second formula of (4): 

| de) 
sj do 
BER SO CE ea 
2-282542 — 32-53 
l 
= 23255 . > oan 
43 325581 | 11-8] 
l | 
— 2-380952 06 
45 = 238095 | + $06 








By interpolation in this table o = 2-355860 was found as the value for which 
d (x*) 
do 


equals zero, and this is the o of the improved Gaussian given above. 


From x? the ‘goodness of fit’ P was found: 





| d(x’) | 

| ae 7. | 
Gaussian a8 10-833 0-211 - 32-53 
Impr. Gaussian 9-720 | 0-285 0-20 


As will be seen the better fit is obtained by making o bigger than the Gaussian 
value, the improvement therefore cannot be looked upon as a correction for grouping. 
On the contrary the Sheppard correction would have given o = 2-264214 and 
have raised x? to 11-52. Thus we see that although the two methods give close 
values for P, the ‘better value’ is obtained as it should be from the lesser value 
of d(x*)/do. 

(3) Illustration II. Fit of a Normal Curve to Bilateral Data. 

For the next illustration I have used a table giving frequencies of cephalic 
index in Bavarian skulls*. Both o and m have here been varied. As the formulae 
(3) are somewhat laborious to work with, the approximations were used roughly 
suggested by the process on p. 264, but the results were not satisfactory t and these 


) 


d(x? d (x? 
approximate results are therefore not given here. But ax ) and “% for the two 
oO 


m 


* J. Ranke, Beitrége zur physischen Antropologie der Baiern, Miinchen, 1883. The table includes 
the material from Tables I-VI and VIII-X inclusive which may be treated as typically ‘ Alt-Baierisch.’ 
d(x?) 

do* 


t In fact the calculation of the exact value of showed that the part of it neglected in 


formula (26) was about ,', of the whole value. It essentially arose from the one tail group, this being 
a n,2 - 
s 


n,? : A . 
382 of the whole neglected part. As ;—— for this group was only as big as 10348, the approximate 
ne = ‘5 


formula (2) cannot be expected to be of great value for the normal curve. 
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Gaussians found in this way were used for interpolation purposes and their constants 
are therefore given in the following table under (b) and (c). By (a) is indicated 
the Gaussian from which we started, namely that found by moments, Sheppard’s 
correction being used. 

= and 


Assuming 7; to be linear functions of o and m, we determined 
d 


from the cases (a), (b) and (c) values of o and m, given under (d), so as to make 
the differential coefficients zero. In the same way we found at last from the cases 
(a), (b) and (d) the constants of the Gaussian (e), the constants of which will be 
found in the following table. As will be seen we have succeeded in bringing the 


d (x?) 
Co 











2 2 
values of “= and = near to zero, certainly close enough for all practical 
Oo 
purposes. 
TABLE II. 
| | | | 
| | d(x? | d(x? 
Ore ee 
iS eee are ar i. : “| | = i | ‘i | 
| (a) | 83-06889 | 3431833 | 10205 | -895 - 67 | +1442 | 
(b) 83-01498 3°358380 10-301 | 891 — 10-56 — 9-97 
(c) 82-98832 3°33 1365 11-048 *854 — 15-89 - 20-76 | 
(d) 83-05329 | 3349421 | 10-108 -899 - 459 | -1210 | 
(e) 83-07774 3°385991 9-858 -909 + WH | + ‘71 
TABLE III. 
| Gaussian | Gaussian 
Observed curve by | improved by | 
moments | minimum x? | 
|_ | 
75 and under 95 | 12-3387 11-3504 | 
76 12-5 12-6842 12-0767 
77 17 22-0702 | 21-3463 | 
78 37 35-2942 34-6005 | 
7 55 51-8794 51-4323 | 
80 71-5 70-0925 70-1100 | 
81 82 87-0421 87-6432 } 
82 116 99-3519 100-4734 | 
83 98 104-2329 | 105-6275 
84 107 100-5128 101-8352 | 
85 82 89-0879 90-0352 
86 74 72-5781 72-9998 | 
87 58 54-3468 | 54-2778 | 
88 34:5 | 37-4049 | 37-0099 
89 19 23-6625 23-1422 | 
90 10 13-7588 13-2703 | 
91 | 7:3532 | 6-9782 
92 and over 9 6-3093 5-7910 





(4) Fit of a Poisson Limit to the Binomial. For a Poisson limit with the 


em m* , 
general term = = find 





3 9(SS— *) ii Re Celene (5), 


VOL. 11 


—“— 
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and putting m = m+ Am, 


Am = 








Vajues of Constants in Frequency Distributions 


Cee ee eee eee eee eeeeeeeeeees 








Starting with m equal to the mean of the observations I have found the 


improved values in the following two numerical 


examples. 


Illustration III. The first table given by L. Whitaker* contains the number 
of deaths per day of women over 85 years, published in the Times newspaper 


during the years 1910-1912. 


























TABLE IV. 
Number . Poisson | 
of deaths | Observed Poisson by first improved by | 
per day moment minimum x? | 
0 364 336-250 331-133 
1 376 397-302 396-334 
2 218 234-720 237-186 
3 89 | 92-446 94-630 | 
4 $3 | (27-308 28316 | 
5 = 6-4532 6-7782 | 
6 ae 1-2708 13521 | 
7 1 0-2508 0-2715 
| 
d 
The m, x?, P and na 200 x") calculated from (5) were determined for the two distri- 
butions as given in Table V. 
TABLE V. 
| | d (x*) 
s m | x? ar | 3 
a 
36 iia: 1 ma 
Poisson ... 1181569 | 15-226 0332 — 35-51 
Poisson improved 1-196903 | 14-943 0361 0-75 
L = 








Illustration IV. As our second illustration we have taken a table of phagocytic 


frequencies published by Major McKendrick f. 








TABLE VI. 
| No. | : Poisson 
of Observed Fe yee by t improved by 
Deaths — minimum x? 
0 | 620 605-924 600-676 
1 282 303-568 306-164 
2 79 76-044 78-026 
3 | 16 12-699 13-257 
4 2 1-5906 1-6892 
5 1 -1738 “1881 











* Biometrika, Vol. x. p. 67. 








+ Proceedings of the London Mathematical Society, Vol. x11. 1913, p. 401. 
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The numerical values of the constants of the series and of the ‘goodness of 
fit’ are 





TABLE VII. 
| | | au) | 
| m i | P ry 
aed ——|—— | ——___—__|_——_ eee ee 
| Poisson ... ... | — -501000 6865 | 231 | - 4186 | 
| Poisson improved -509700 6-672 246 | - 1-21 
| | | 








This table is of interest because it illustrates the apparent paradox, already seen 
in the vase of the second Gaussian curve illustration, that the ‘mean’ is not neces- 
sarily the ‘best value’ of the constant termed the ‘mean.’ 


(5) Fit of a Binomial to Binomial Data. 


Let 7, be equal to the (s+ 1)th term of the binomial (p+ q)', where 
p+q=1, or to 


ae (l= 1)... (U— 8 +0). 
p (1 _ p) . s! > 
we then find 
dn,_,!—pl-s_, m—8 
dp ‘*p(l—p) “p(l—p)’ 


where m is the mean or stand for /(1 — p), 


d2n, n, I ] —" ! 1-2 (1 , 
dp? ~ pa — pp — pl — s)? — (1 — pl — s) (1 — 2p) — lp(1 — p)} 

2 re p) {(m — s)® + (m — s) (1 — 2p) -+ mp}, 
l ] 1 
a i (loge m+ 7 + 79 yrs, 
ee Fe ee oe ees of 
ait = *+|(logep + 7+ 7-44 I—st+1) BF (ip (=s Ff” 


d?n, 7, ] 


= — big 8) (log p+ . + pf +... + 7 )4 (1 ~ yi. 
didp p(l—p)(| ie Be Ps | l—si+l j 
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and the equations (2a) take the form 


g (Ms m—s n,* a 
"E> - = = § ani [(m — s)* + (m — s) (1 — 2p) + mp]) Ap 


| ia Ja. | 
a S Le —p) [im = 8) (log. p + ] ie 2 aa + eee + i s+ :) —, (1 -»)}) Al 
2 1 1 1 
ae eee as dis ecteminiemane a 
8 (7 log. + jtj-1t tari) r( ), 


/ n,* 1 


1 1 


;. 4 a ee. 1 
og ee bite ] 
+8(%2 [(murste mit +a) tet appt ~+aarip)4 


while the approximate formulae of the type (26) are 





n m— 8 nN,” ! 
r : i ~~ ee — 
(F p(1— p)/ Slit —pplm — sl) Ap 
n2 


+28 (tp [om -9) (let pty + taxa) 4 
se fino fgg tee Ho 
= 28 & ie jm — 8) (log. p+ ; + my Yur Tee zs i)]) ~~ 
+ 28 (Gr ez. + ; + — +e + rl) Al 
Illustration V. Weldon’s Dice Data. 


For illustration are used the following data due to, the late Professor W. F. R. 
Weldon*. They give the observed frequencies of dice with five or six points when 
a throw of twelve dice was made 26306 times. 


TABLE VIII. 








| 





| - 
Number of dice *, | Binomial b Improved binomial | Improved binomial 
| Observed y | 
in cast with 5 | method of | (a) by x2 a | (6) by x2 a 
| frequenc y a 
| or6points | 1 y moments minimum | minimum 
| = z= a en Senta 
| 0 | 185 189-679 190-651 | 190-659 
| 1 1149 | «1154-441 1157-607 1157-600 
| 2 3265 3223-426 3226-085 | 3225-959 
3 5475 5461-01 5458-07 | 5457-78 
| 4 6114 =| 6253264 6245-98 6245-71 
5 | 5194 5101-31 5095-82 5095-79 
| 6 3067 3041-04 3041-47 3041-69 
| 7 1331 1335-82 1339-55 1339-81 
8 403 429-627 432-815 432-984 
9 | 105 98-865 100-351 100-419 
10 14 15-5133 15-9413 15-9595 
11 4 1-57640 1-65879 1-66210 











* Phil. Mag. July, 1900, p. 167. 
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Fitting the frequencies from the end by means of two moments we obtain the 
binomial 
(°6658208 + -3341792)12-126379, 
the terms of which are given in the table above under the head Binomial. 


From these starting values of p and / we found by the equations (8) the constants 
of the improved binomial (a) p = -6674922 and J = 12-188945. 


A comparison between the coefficients of the two sets of formulae (8) and (9) 
gave the result that they only diverged by between 1-4 and 5 per mille of their 
2.7 2 
value. As ~* - — for the tail group was as big as 5-44, we are from this justified 
& 





in expecting the approximate formulae (9) to be useful for binomial data. 


Starting from the improved binomial (a) another improved binomial (b) was 

found by means of the formulae (9). As will be seen I only succeeded in 
2 2 

diminishing + by raising _ , and x* came out with exactly the same value 

as by the former formula. The constants for the improved binomial (b) are 
p = °6675432 and / = 12-191141. 


The constants illustrating the ‘goodness of fit’ were found as follows: 











TABLE IX. 
d(x?) d(x?) 
x di “dp “a 
Binomial rae ane 11-643 390 | 159-47 - 8-15 
Improved Binomial (a) 11-513 ‘401 | 26:02 | - -84 
is (6) | 11-513 ol | - 02 | +1-96 
| ! 








1t will be seen from the. above illustrations that the probability of happening 
as determined by the x? test of ‘goodness of fit’ being a maximum can always be 
made somewhat greater than the same probability deduced from a fit by the 
method of moments, which at any rate for the Gaussian curve is usually assumed 
to be the ‘best.’ 

(6) On the ‘Best’ Values of thé Constants of Regression Curves. 

If we apply the test of ‘goodness of fit’ to regression curves as recently indicated 
by Pearsen* modifying Slutsky’s methods, we shall experience the same divergence 
between the curves of regression found by the method of least squares and the 
curves calculated so as to make x? a minimum, as we found when dealing with 
frequency distributions. 

In the paper cited x? for a regression curve is given as 

gf ' (etm = _ Partner te (10), 
Oty 


* Biometrika, Vol. xt. pp. 239 et seq. 
¢ Journal of the Royal Statistical Society, Vol. Lxxvu. pp. 78-84. 
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where m is the mean of the pth array of the sample of size M from a population 
of size N, while m, is the theoretical mean as found from the regression curve, 
Ny, = i, M/N, is the mean frequency and oj, the mean standard deviation of the 
pth array in the samples. The difficulty in applying the ‘goodness of fit’ test 
lies in finding adequate values for n, and o;,. Let us assume them to be found. 
The ‘best’ values of the constants /,, f,, ... of the regression curve, i.e. the values 
which make x? a minimum, will then be found from equations of the type 


Ny \} 
= —2S a 4 (m, — Mi, Fe Ft +8 hee: wie (o)} eee (11). 


As will be seen these equations fall into the equations resulting from using 





the method of least squares if ej 
Np 


curve and at the same time for the different arrays proportional to the n, of the 
sample. Even if our sample be derived from truly Gaussian data, these conditions 
will only approximately be satisfied, the o;,, although constant, being dependent 


is independent of the constants of the regression 


upon the constants of the regression curve and the n, of the formula not being 
really the sample value. 


Supposing a to be independent of the constants of the regression line 
Np 


m,=ax+b, the equations (11) take the form 
S {v, (m, — ax — b) a} = 0, 
S {v, (m, — ax — b)} = 0, 


when we put v, for _ 


From these equations we find 
_ S(vpm,z) . S (vy) — S (vymy) S (x) 
~ S$ (v,x?) . S (v,) — {8 (v,2)}? 
ain pS (om) _ S (v,2) 


S (v,) ‘So (v,) ’ 
formulae agreeing with those derived from the method of least squares if v, equals 
the marginal frequencies of the sample. But not agreeing with them if, for example, 
the material be heteroscedastic. 
(7) Illustration VI. Auricular Height of School Girls. 


This example was first used by Pearson in the memoir un skew correlation * 
and later as an illustration of the test of ‘goodness of fit’ of regression curvesf. 








* Drapers’ Company Research Memoirs, Biometric Series 11. p. 34. 
+ Biometrika, Vol. x1. p. 253. 
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For the present use theoretical values of n, and oj 2 were determined, from which the 
values of v, given in Table X are calculated. Then, and v, of the table represent the 
weights given to the means of arrays respectively by the method of least squares and 
by our method of making x? a-minimum. It will be seen that our method throws 
































TABLE X. 
| rs | ae Mp My 
Age | observed ed observed | {fom x* | from least 
| | &@minimum | squares 
3—4 1 5:3790 115-25 | 117-76 | 117-95 | 
4—5 7 13-7170 116-96 118-44 118-61 | 
5—6 18 28-5973 117-47 119-13 119-27 
6—7 40 56-0527 11910 | 119-81 119-94 
7—8 76 95-3828 120-30 | 120-49 | 120-60 
8—9 125 146-023 121-63 121-17 121-26 
9—10 177 199-783 121-72 | 121-86 121-92 
10—11 235 243-414 122-82 | 122-54 | 122-59 
11—12 261 271-704 123-14 | 12322 | 123-25 
12—13 309 | 277-232 123-89 | 12390 | 123-91 
13—14 262 259-386 124-86 | 124-59 | 124-58 
14—15 198 223-505 125-71 125:27 | 125-24 
15—16 214 172-851 126-16 | 12595 | 125-90 
16—17 162 121-965 126-53 | 126-63 126-57 
17—18 95 75-7303 126-91 127-32 127-23 
18—19 61 } 43-0926 127-02 128-00 127-89 
19—20 13 | 21-2448 | 129-56 128-68 128-55 
20—21 a | 8-09110 123-82 129-36 129-22 
21—22 8 6-42326 126-50 130-05 129-88 
22—23 | 2 2-42653 125-25 130-73 130-54 


| 
| 


the weight more to the first half part of the groups of ages than the method of 
least squares. This is due to the heteroscedasticity of the material, the o%;, 
varying from 27-2776 in the youngest group to 60-4676 in the eldest. The two 
last columns of Table X contain the m, calculated from our regression formula 
and from the usual formula; as might be expected our m,’s are closer to the 
means of the obse=vations for the younger groups of ages and differ more for the 
higher ages than co the m, values obtained by the method of least squares. The 
x? calculated by (10) are for the two cases 18-45 and 18-67 and we have only raised 
the ‘goodness oi fit’ P from ‘543 to ‘558 although the weighting in the two 
methods appeared sensibly different. 





The usual regression line is 
Mm, = 124-0467 + -662979 (x, — 12-7007), 
124-0467 and 12-7007 being the general means, and regression line from the x? 
formula may be written 
M, = 124-0411 + -682455 (x, — 12-7007) 
from which is seen that it passes not far from the mean. 
In a similar way I have treated the regression of ages on height of head. Also 


I have here calculated the heteroscedasticity and have had to use a parabola to 
Biometrika x1 
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Diacram I. Comparison of Regression Straight Lines found by method of 
Least Squares and by x? Test. 
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represent o?;,, the squared standard deviation of the arrays of same height, to 
obtain a reasonable description; this is shown on the diagram. The marginal 
frequencies of the height variate could be expressed fairly well by a Gaussian curve. 


These theoretical values of oj, and n,’ are given in Table XI together with the 
weights 


calculated from them. 











TABLE XI. 
ae oi, Ny , | Np | Mp, | wd 2 | +, oe 
Millims theoretical | thecretical 4 jobserved ‘ppaccpioes | Beret | Bcc 
ee eee, Bail Rahat eben sera | 
102-25—104-25 | 8456 | 4-73 4-7123 | 2 500 | 992 | 999 | 
104:25—106-25 | 8-748 | 6-62 6-3809 10 10-40 | 1018 | 10:25 | 
106-25—108-25 8-987 | 13-89 13-0282 10 | 11-10 | 10-45 | 10-51 | 
108-25—110-25 9-172 | 26-80 24-6339 27 «| «11:54 | 10-72 | 10-77 | 
110-25—112-25 9-304 47-59 43-1217 56 | 11-71 | 10-99 | 11-03 
112-25—114-25 9-382 | 177-76 69-8648 59 =| 11-81 11-26 | 11-29 | 
114-25—116-25 9-408 | 116-90 104-750 115 | 11-62 | 11-53 | 11-55 
116-25--118-25 | 9-380 | 161-71 | 145-332 142 11-70 | 11-80 11-81 
118-25—120-25 9-298 | 205-33 | 186-597 244 11:80 , 12:06 | 12-08 
120-25—122-25 9-164 241-06 | 221-744 265 12-15 12-33 12-34 
122-25—124-25 8-976 | 259-78 | 243-960 261 12-52 | 12-60 | 12-60 
124:25—126-25 | 8-735 257-59 248-580 265 12-83 | 12-87 | 12-86 
126-25—128-25 | 8-441 235-02 234-710 219 12-98 | 13-14 | 13-12 
128-25—130-25 | 8-093 197-30 205-508 197 13-78 | 13-41 | 13-38 
130-25—132-25 7-692 | 152-41 167-023 131 13-85 | 13-67 13-64 
132-25—134-25 | 7-238 108-33 126-167 88 13-78 | 13-94 | 13-90 
134-25—136-25 | 6-730 70-85 88-7361 17 14-28 | 14:21 | 14:16 
136-25—138-25 | 6-170 42-64 58-2529 52 14-40 | 14-48 | 14-42 
138-25—140-25 5-556 23-61 35-8204 | 20 1405 | 14:75 | 14-69 
140-25—142-25 | 4-888 12-03 20-7416 16 14-56 | 15-02 | 14-95 
142-25—144-25 4-168 5-64 11-4040 11 14-95 | 15-29 | 15-21 
144-25—146-25 | 3-394 2-43 6-0407 4 18:00 | 15°55 | 15:47 
146-25—148-25 | 2-567 1-49 | 4-8835 1 19-50 | 15°82 | 15:73 
1 











The usual regression line is 
My, = 12-7007 + -130489 (y, — 124-0467), 
and the line for which x? is a minimum is 
mM, = 12-7071 + -1342345 (y, — 124-0467). 
For x* were found in the two cases the values 44-411 and 44-109 and for the 
‘goodness of fit’? P the values 0047 and ‘0051*. 


* A case was purposely chosen in which the regression was known to be far from linear, in order to 
ascertain whether this fact itself would separate at all widely the least square and x? regression lines. 
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The intersection point of the two x? regression lines is m= 124-0453, 
m’ = 12-7070, which is seen to be very near to the general means. Introducing 
that point into the equations of the lines, they take the form 


M, = 12-7070 + -1342345 (y, — 124-0453), 
M, = 124-0453 + -682455 (x, — 12-7070). 


From the slopes of the lines we find the value -3027 for the correlation coefficient, 
whereas the method of least squares gives the value -2941. 


Although we have found the material to be decidedly heteroscedastic and the 
weighting of the two series of means rather different from that of the marginal 
frequencies, we nevertheless see that the resulting regression lines differ very little 
from the ordinary regression lines, both the deviations of the means and the 
correlation coefficient derived from them being less than their probable errors. 


(8) The conclusions to be drawn from the present investigation are: 


(i) The definition of ‘best,’ which leads to the method of moments being con- 
sidered ‘best’ and incidentally to the method of least squares being considered 
‘best,’ is undoubtedly somewhat arbitrary. If we use Pearson’s ‘Goodness of Fit’ 
test, then the method of moments is not necessarily the ‘best,’ the best value of 
the constant termed the mean is not necessarily the mean, nor generally the best 
value of the correlation coefficient between two variates that calculated by the 
moments and product moment method. 


(ii) On the other hand the present numerical illustrations appear to indicate 
that but little practical advantage is gained by a great deal of additional labour, 
the values of P are only slightly raised—probably always within their range of 
probable error. In other words the investigation justifies the method of moments 
as giving excellent values of the constants with nearly the maximum value of P or 
it justifies the use of the method of moments, if the definition of ‘best’ by which 
that method is reached must at least be considered somewhat arbitrary. 


The present paper was worked out in the Biometric Laboratory and I have 
to thank Professor Pearson for his aid throughout the work. 
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Note on the Standard Deviations of Samples of Two or Three. 


By ANDREW W. YOUNG, M.A. 


In an “ Editorial” contained in Vol. x of Biometrika*, there is a discussion of the distribution 
of the values of the standard deviation of a population which are deduced from small samples of 
the population. It is there shown how the distribution approaches normality as the number, n, 
in the sample increases, a table of the characteristic constants of the frequency curves for various 
values of n being given. The smallest sample considered is that of n = 4, but samples of two and 
three are of occasional occurrence especially in physical work and now and again a value of the 
probable error of an experimental] result is deduced from a set of two or of three observations. 
A knowledge of the theoretical distribution of the standard deviations for such small samples 
will give us some idea of the reliability of this procedure and it is the object of this note to supply 
this omission from the former paper. 

“Student’s” formula for the distribution of samples of standard deviation is 

_1 nz? 
Y = Yor"*e 2%, 
where o is the standard deviation of the whole population and = is the standard deviation given 
by a sample of size n. Thus the distribution for samples of two is 
=2 
y=me ™, 
extending from = = 0 to 2 = », i.e. the distribution is simply half of a normal curve; and the 
distribution for samples of three is rom : 
y =yore 2%, 
also, of course, extending from = = 0 to >= 0. 


It is easy to find by direct integration the moment coefficients of these curves. 


Case of Samples of Two. 
If we denote by N the total number of samples in the assumed distribution, 


oo 2 J 
N=» | e #d3 = Yo —>-- 
0 


Taking the first four moment coefficients about the origin to be py’, jo’, 4’, #4’, and the moment 
coefficients about the mean to be, as usual, py, pg, 4, We have 


> 
— Ss sg ge J 
ty’ = Mean value of S=2=H Pia ode = ON 7 * 5642c. 
Ps =? 3 2 
+ = M0 [520° sigs a MOONE _ Oo 
pi = ¥8 [ate as = % 7" : 
ee = ot(t— 1) = -1817,62 
giving: els = -)= 7c, 


and the standard deviation of 3 = os = -4263¢. 


* “On the Distribution of the Standard Deviations of Small Samples.” Appendix I to papers by 
“Student” and R. A. Fisher. Biometrika, Vol. x. p. 522, 1915. 
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32 


oR i ge Yo o* o 
3-5 Se ds = — = ———., 
ai N Jo N 2 Ne 


, 3/2 
giving: 4s = Js ( - >) = 07710°. 
— Yo [54 5; Yo 30°Var _ 5 4 
ni = he | "ste Has N z= ae 
ios 3 
giving: Ha = o% G + ¢ “3) = -1277o* 
T Tv 


From these we derive 
B, = *9906, B. = 3°8692. 


Case of Samples of Three. 


wy’ =3 =/§- = ‘12360. 


In the same way 


, . 2a* 
Ss. = 3 > 
aA 2(2 fs , 
giving: fo =o (s -3)= 143102, 
and GC, = ‘3782. 


giving: Bs = 0° Ji (F - 1) = °03420°. 


Ba’ = Go, 
_ 8 7? 
giving: fy = of (5 ~ 5) = 066404, 
and B, = °3983, Be = 3°2451. 


Modal Values. 

In the case of n = 2, the mode of the theoretical curve is at the origin = = 0, but it is to be 
borne in mind that it is the areas of strips of the frequency-curve which are to be used to estimate 
the probability. In practice, therefore, seeing that all measurements must be made in discrete 
amounts and cannot be mathematically continuous, we can only assert that the most frequently 
occurring values of = are those which are nearest zero—not actually zero. The example given 
below will make this clear. 


For n = 3, the mode is obtained by differentiating the equation 


This gives, for the mode, 


thus the modal value § of 3 is given by 


Skewness. 
The skewness of the distributions is for n = 2, 1-3236, and for n = 3, -3867. 
Thus whether we consider the mean or the modal values of the distributions, it is evident that 
the probable error determined from a set of three observations is very untrustworthy and that when 
there are only two observations it is very much worse. 











Miscellanea 279 


With the preceding calculations we can now complete the Table on p. 529 of Biometrika, 
Vol. x, with the following 


TABLE I. 


Table of Values of the Constants of the Frequency Distributions of the Standard 
Deviations of Samples of Size 2 and 3 drawn at random from a Normal 
Population. 


a a | Standard Deviation | Measures porns a 
Size of| Mode | Mean | andar eviation | Measures of Deviation from Normality 


sample} S]o S/o | ete: ; ; | 





| Oslo o3|(o/J/2n) | Skewness B, Bo 
2 (00 | -5642| -4263 | -8525 | 1-3286 9906 | 3-8692 
3 


| -5774| 7236| -3782 | ‘9265 | +3867 3983 | = 33-2451 





Experimental Verification for the Case of Samples of Two. 
The frequency distribution of the number of stigmatic bands on the capsules of a growth of 
Shirley poppies is as follows*: 


TABLE IT. 


| | 
| | | | 
9 | 10 | 11 | 12| 13 | 14| 


No. of Stigmati | | | | 











— 
| | 

Bands...) 5 | 6 | 7| 8 | 15} 76| 17] 18 | 19 | Total| 
No. of Capsules | 1 | 11 | 32 | 56 | 148| 363 | 628/925 954 | 709|397 155] 51 | 12 | 1 | 4443 | 
z ae i ne - ! 2S ee Se: | 2 | 





the standard deviation of the distribution, as calculated by the ordinary method, being 1-8977. 


We will examine the frequency distribution of the values of the standard deviation of this 
series which are given by samples of two capsules taken at random. Now the standard deviation 


: . %- 2 : - 
of two measurements 2, and 2, is easily shown to be ~1——~*, taken with the plus sign, so that 


2 
in this case, the variate being measurable only in units, the possible standard deviations will 
be 0, -5, 1, 1-5, ..., and we can find the theoretical frequencies (column (b) of Table III) of these 
values by taking the areas of the strips of the half-Gaussian 
>= 2 
y= — 2N (rama), 
1:8977 V 
whose bounding ordinates cut the axis of = at 0, -25; +25, -75; -75, 1:25; ..... Thus the breadth 
of the first strip is only half that of the others and it will be found that it is the value -5 which is 
the real mode of the probability. 

In the case of such a distribution we can find the frequencies of the standard deviations of 
samples of two by actual calculation. For if we denote the chance of a capsule occurring with 
5, 6, 7, 8, 9, ... stigmatic bands by a, b, c, d, e, ... respectively, it is clear that the chance of the 
value 0 of the standard deviation occurring is a? + b? + c? + d*+ ..., and the chance of the value -5 
occurring is the chance of a sample of two flowers whose numbers of stigmatic bands differ by 1, 
i.e. is 

ab + b(a+c)+c(b +d) +... =2(ab + be + cd + ...), 
and similarly for = = -1 the chance is 
ac+bd+c(at+e)+d(b+f)+...=2(ac+ bd + ce + ...), 
and so on. 

By this means the theoretical frequencies given in column (a) of Table III were calculated. A 

histogram of these values will be found to be well in accord with the half-Gaussian given above. 


* Pearson, Phil. Trans. Vol. 197 A, p. 314, Hampden Series. 
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With a view to confirming the theoretical results, a practical random sampling was made in 
the following way. A circle, about 20 inches in diameter, was carefully divided into sectors 
whose angles (and therefore areas) were proportional to the class-frequencies given in Table Ii. 
For each sampling a pointer was placed at random inside the circle and the number of the sector 
in which it was placed was noted. It appeared that, if the observer kept his eyes shut and 
rotated the circle for a time between each pointing, the method would give quite satisfactorily 
random sampling. In this way an ordered series of 648 samples was made and by taking samples 
of two by pairing in accordance with three different rules, the series of numbers given in ‘fable III 
were obtained. Alongside of each series in the Table is given the ratio of the number in each 
set to the total. These ratios are to be compared with the “ theoretical chances.” In all three 
samples the number with standard deviation 0 is considerably less than the theoretical numbers, 
but from the values of the Goodness of Fit “P” given at the ends of the columns it appears that this 
can be accounted for by the variations of random sampling. 


TABLE III. 


Frequency distribution of Standard Deviations of Samples of Two 
Capsules from the Shirley Poppy Series. 





Experimental Results 





Standard Theoretical II } Ill 
Deviation Chance = : piste 


l | : l 
Number of Number of a | Number of 
samples | samples | 


-1172 
-3056 
-2315 
-1790 
0802 
-0463 
0154 
0154 
-0062 
-0031 


—) 








5 
1 
1-5 
2-0 
2:5 
3-0 
3-6 
4-0 
4: 
5-0 








o 
a 





Using Theoretical Chance (a) 
Using Theoretical Chance (5) 











The mean standard deviations given by these sets are for I, 1-06, for II, 1-07, for III, 1-05, 
while those derived from the half-Gaussian and from the “theoretical chances” are 1-071 and 
1-050 respectively. It will be noticed also that although the calculated chances (a) appear to 
give a better fit than the half-Gaussian, the differences in the “ P’s” are very slight, and the 
half-Gaussian would suffice for most purposes. These results form a strong confirmation of The 
correctness of the theory. 
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